Skip to main content

Datasets

Datasets are a great option for reducing the storage requirements of jobs as well as reusing data across many jobs. Public dataset are complete free to use and private datasets only will only incur storage charges for their own size, no matter how many jobs they are used on.

Public Datasets

Public datasets are a collection of popular public domain machine learning datasets that are loaded and maintained by proxiML. If you are planning to use one of the below datasets in your model, be sure to select it in the job form as instructed below instead of provisioning worker storage and downloading it yourself.

Image Classification

Text Processing

Object Detection/Segmentation

If you would like a public dataset added, please contact us with a link to the dataset and a brief description of what you need it for.

Using a Public Dataset

Public datasets can be used by selecting Public Dataset from the Dataset Type field in the Data section of the job form. Select the desired dataset from the list and create the job. Once the job is running you can access the dataset in the /opt/ml/input directory, or using the ML_DATA_PATH environment variable.

Private Datasets

Private datasets enable you to load a dataset once and reuse that dataset on any future job or multiple jobs at the same time while only incurring storage charges once based on the size of the dataset. Private datasets are also included in the 50 GB free storage allotment. Datasets are immutable to prevent unexpected data changes impacting jobs. If you need to revise a dataset, you must create a new one and remove the old one. The maximum size of any dataset is 500 GB, but you can have unlimited datasets.

Creating a Dataset

Datasets can be created from three different sources: external, notebooks, and training/inference job output.

External Checkpoint Source

From the Datasets section, click the Create button. Specify the name of the new dataset in the Name field and then select the Source Type of the data to populate the new dataset:

  • AWS: Select this option if the data resides on Amazon S3.
  • Azure: Select this option if the dataset data resides on Azure Blob Storage.
  • GCP: Select this option if the data resides on Google Cloud Storage.
  • Kaggle: Select this option if the data is from a Kaggle Competition or Dataset.
  • Local: Select this option if the data resides on your local computer. You will be required to connect to the dataset for this option to work. Jobs using the local storage options will wait indefinitely for you to connect.
  • Regional Datastore - Select this option to mount the dataset directly from a Regional Datastore in an existing CloudBender region.
  • Wasabi - Select this option if the dataset data resides on Wasabi Storage.
  • Web: Select this option if the data resides on a publicly accessible HTTP or FTP server.

Specify the path of the data within the storage type specified in the Path field. If you specify a compressed file (zip, tar, tar.gz, or bz2), the file will be automatically extracted. If you specify a directory path (ending in /), it will run a sync starting from the path provided, downloading all files and subdirectories from the provided path. Valid paths for each Source Type are the following:

  • AWS: Must begin with s3://.
  • Azure: Must begin with https://.
  • GCP: Must begin with gs://.
  • Kaggle: Must be the short name of the competition or datasets compatible with the Kaggle API.
  • Local: Must begin with / (absolute path), ~/ (home directory relative), or $ (environment variable path). Relative paths (using ./) are not supported.
  • Regional Datastore - Must begin with / (absolute path)
  • Wasabi: Must begin with s3://.
  • Web: Must begin with http://, https://, ftp://, or ftps://.
Source Specific fields
  • Type (Kaggle Only): The type of Kaggle data you are specifying, Competition, Dataset, or Kernel (Notebook).
  • Endpoint (Wasabi Only): The service URL of the Wasabi bucket you are using.
  • Path (Regional Datastore Only): The subdirectory inside the regional datastore to load the data from. Use / to load the entire datastore.

Click Create to start populating the dataset. If you selected any option except Local, the dataset download will take place automatically and the dataset will change to a state of ready when it is complete. If selected Local, you must connect to the dataset by selecting the dataset and clicking the Connect button to proceed with the data population.

Note on Kaggle Data

Kaggle datasets require you to specify if the data will be populated from a Kaggle competition or a datasets. In the field, select if you are downloading the data for a competition you have entered or `` if you are downloading other public or personal dataset.

caution

You can only download competition datasets if you have already read and accepted the rules through the Kaggle website

For the Path field, you must specify the short name Kaggle uses the competition or the datasets. The two easiest ways to find this short name are:

  1. The URL path of the competition or dataset you wish to download. For example, if you are viewing this dataset on the 2020 US Election in your web browser, the URL in your address bar is https://www.kaggle.com/unanimad/us-election-2020. If you want to download this dataset into proxiML, specify unanimad/us-election-2020 in the Path field, specifically, the URL component after www.kaggle.com/. If you are viewing the Mechanisms of Action competition in your web browser, the URL in your address bar is https://www.kaggle.com/c/lish-moa. If you want to download this competition's data into proxiML, specify lish-moa in the Path field, specifically, the URL component after www.kaggle.com/c/
  2. Viewing the API command from the Kaggle web interface. For datasets, if you click the triple dot button on the far right side of the Kaggle Dataset menu bar, next to the New Notebook button, there is a Copy API command button. If you click this for this dataset on the 2020 US Election, it will copy kaggle datasets download -d unanimad/us-election-2020 into your clipboard. If you want to download this dataset into proxiML, specify unanimad/us-election-2020 in the Path field, specifically, the command component after download -d. For a competition, if you click the Data tab on the Kaggle Competition menu bar, right above the Data Explorer, it will list the API command to download the datasets. If you are viewing the Mechanisms of Action competition, you will see kaggle competitions download -c lish-moa. If you want to download this competition's data into proxiML, specify lish-moa in the Path field, specifically, the command after download -c

Notebooks

To create a dataset from an existing notebook, select the notebook from the Notebook Dashboard and click Copy. The Copy button is only enabled when a single notebook is selected and that notebook is either running or stopped. Select Save to proxiML as the Copy Type. Select Dataset from the Type dropdown and enter the name for the new dataset in the New Dataset Name field. You have the option to copy either the /opt/ml/models folder or the /opt/ml/output folder. Select which folder you wish to copy from the Save Directory dropdown and click Copy to being the copy process. You will be automatically navigated to the datasets dashboard where you can monitor the progress of the dataset creation.

Training/Inference Job Output

Training or inference jobs can be configured to send their output to a proxiML dataset instead of an external source. To create a dataset from a job, select proxiML as the Output Type and dataset as the Output URI in the data section of the job form. Once each worker in the job finished, it will save the entire directory structure of /opt/ml/output to a new dataset with the name Job - <job name> if there is one worker or Job - <job name> Worker <worker number> if there are multiple workers.

Using a Private Dataset

Private datasets can be used by selecting My Dataset from the Dataset Type field in the Data section of the job form. Select the desired dataset from the list and create the job. Once the job is running you can access the dataset in the /opt/ml/input directory, or using the ML_DATA_PATH environment variable.

Removing a Dataset

Dataset can only be removed once all jobs that are configured to use them are terminated. To remove a dataset, ensure that the Active Jobs column is zero, select the dataset, and click the Delete button. Since this action is permanent, you will be prompted to confirm prior to deleting.