Datasets
Datasets are a great option for reducing the storage requirements of jobs as well as reusing data across many jobs. Public dataset are complete free to use and private datasets only will only incur storage charges for their own size, no matter how many jobs they are used on.
Public Datasets
Public datasets are a collection of popular public domain machine learning datasets that are loaded and maintained by proxiML. If you are planning to use one of the below datasets in your model, be sure to select it in the job form as instructed below instead of provisioning worker storage and downloading it yourself.
Image Classification
CIFAR-10
: https://www.cs.toronto.edu/~kriz/cifar.htmlCIFAR-100
: https://www.cs.toronto.edu/~kriz/cifar.htmlImageNet
: http://www.image-net.org/
Text Processing
MultiNLI
: https://cims.nyu.edu/~sbowman/multinli/SNLI
: https://nlp.stanford.edu/projects/snli/WikiText-103
: https://blog.einstein.ai/the-wikitext-long-term-dependency-language-modeling-dataset/
Object Detection/Segmentation
COCO
: https://cocodataset.org/#homePASCAL VOC
: http://host.robots.ox.ac.uk/pascal/VOC/CamVid
: https://www.kaggle.com/datasets/carlolepelaars/camvidMedical Segmentation Decathlon
: http://medicaldecathlon.com
If you would like a public dataset added, please contact us with a link to the dataset and a brief description of what you need it for.
Using a Public Dataset
Public datasets can be used by selecting Public Dataset
from the Dataset Type
field in the Data
section of the job form. Select the desired dataset from the list and create the job. Once the job is running you can access the dataset in the /opt/ml/input
directory, or using the ML_DATA_PATH
environment variable.
Private Datasets
Private datasets enable you to load a dataset once and reuse that dataset on any future job or multiple jobs at the same time while only incurring storage charges once based on the size of the dataset. Private datasets are also included in the 50 GB free storage allotment. Datasets are immutable to prevent unexpected data changes impacting jobs. If you need to revise a dataset, you must create a new one and remove the old one. The maximum size of any dataset is 500 GB, but you can have unlimited datasets.
Creating a Dataset
Datasets can be created from three different sources: external, notebooks, and training/inference job output.
External Checkpoint Source
From the Datasets
section, click the Create
button. Specify the name of the new dataset in the Name
field and then select the Source Type
of the data to populate the new dataset:
AWS
: Select this option if the data resides on Amazon S3.Azure
: Select this option if the dataset data resides on Azure Blob Storage.GCP
: Select this option if the data resides on Google Cloud Storage.Kaggle
: Select this option if the data is from a Kaggle Competition or Dataset.Local
: Select this option if the data resides on your local computer. You will be required to connect to the dataset for this option to work. Jobs using the local storage options will wait indefinitely for you to connect.Regional Datastore
- Select this option to mount the dataset directly from a Regional Datastore in an existing CloudBender region.Wasabi
- Select this option if the dataset data resides on Wasabi Storage.Web
: Select this option if the data resides on a publicly accessible HTTP or FTP server.
Specify the path of the data within the storage type specified in the Path
field. If you specify a compressed file (zip, tar, tar.gz, or bz2), the file will be automatically extracted. If you specify a directory path (ending in /
), it will run a sync starting from the path provided, downloading all files and subdirectories from the provided path. Valid paths for each Source Type
are the following:
AWS
: Must begin withs3://
.Azure
: Must begin withhttps://
.GCP
: Must begin withgs://
.Kaggle
: Must be the short name of the competition or datasets compatible with the Kaggle API.Local
: Must begin with/
(absolute path),~/
(home directory relative), or$
(environment variable path). Relative paths (using./
) are not supported.Regional Datastore
- Must begin with/
(absolute path)Wasabi
: Must begin withs3://
.Web
: Must begin withhttp://
,https://
,ftp://
, orftps://
.
Source Specific fields
Type
(Kaggle Only): The type of Kaggle data you are specifying, Competition, Dataset, or Kernel (Notebook).Endpoint
(Wasabi Only): The service URL of the Wasabi bucket you are using.Path
(Regional Datastore Only): The subdirectory inside the regional datastore to load the data from. Use/
to load the entire datastore.
Click Create
to start populating the dataset. If you selected any option except Local
, the dataset download will take place automatically and the dataset will change to a state of ready
when it is complete. If selected Local
, you must connect to the dataset by selecting the dataset and clicking the Connect
button to proceed with the data population.
Note on Kaggle Data
Kaggle datasets require you to specify if the data will be populated from a Kaggle competition or a datasets. In the field, select
if you are downloading the data for a competition you have entered or `` if you are downloading other public or personal dataset.
You can only download competition datasets if you have already read and accepted the rules through the Kaggle website
For the Path
field, you must specify the short name Kaggle uses the competition or the datasets. The two easiest ways to find this short name are:
- The URL path of the competition or dataset you wish to download. For example, if you are viewing this dataset on the 2020 US Election in your web browser, the URL in your address bar is
https://www.kaggle.com/unanimad/us-election-2020
. If you want to download this dataset into proxiML, specifyunanimad/us-election-2020
in thePath
field, specifically, the URL component afterwww.kaggle.com/
. If you are viewing the Mechanisms of Action competition in your web browser, the URL in your address bar ishttps://www.kaggle.com/c/lish-moa
. If you want to download this competition's data into proxiML, specifylish-moa
in thePath
field, specifically, the URL component afterwww.kaggle.com/c/
- Viewing the API command from the Kaggle web interface. For datasets, if you click the triple dot button on the far right side of the Kaggle Dataset menu bar, next to the
New Notebook
button, there is aCopy API command
button. If you click this for this dataset on the 2020 US Election, it will copykaggle datasets download -d unanimad/us-election-2020
into your clipboard. If you want to download this dataset into proxiML, specifyunanimad/us-election-2020
in thePath
field, specifically, the command component afterdownload -d
. For a competition, if you click theData
tab on the Kaggle Competition menu bar, right above theData Explorer
, it will list the API command to download the datasets. If you are viewing the Mechanisms of Action competition, you will seekaggle competitions download -c lish-moa
. If you want to download this competition's data into proxiML, specifylish-moa
in thePath
field, specifically, the command afterdownload -c
Notebooks
To create a dataset from an existing notebook, select the notebook from the Notebook Dashboard and click Copy
. The Copy
button is only enabled when a single notebook is selected and that notebook is either running
or stopped
. Select Save to proxiML
as the Copy Type
. Select Dataset
from the Type
dropdown and enter the name for the new dataset in the New Dataset Name
field. You have the option to copy either the /opt/ml/models
folder or the /opt/ml/output
folder. Select which folder you wish to copy from the Save Directory
dropdown and click Copy
to being the copy process. You will be automatically navigated to the datasets dashboard where you can monitor the progress of the dataset creation.
Training/Inference Job Output
Training or inference jobs can be configured to send their output to a proxiML dataset instead of an external source. To create a dataset from a job, select proxiML
as the Output Type
and dataset
as the Output URI
in the data section of the job form. Once each worker in the job finished, it will save the entire directory structure of /opt/ml/output
to a new dataset with the name Job - <job name>
if there is one worker or Job - <job name> Worker <worker number>
if there are multiple workers.
Using a Private Dataset
Private datasets can be used by selecting My Dataset
from the Dataset Type
field in the Data
section of the job form. Select the desired dataset from the list and create the job. Once the job is running you can access the dataset in the /opt/ml/input
directory, or using the ML_DATA_PATH
environment variable.
Removing a Dataset
Dataset can only be removed once all jobs that are configured to use them are terminated. To remove a dataset, ensure that the Active Jobs
column is zero, select the dataset, and click the Delete
button. Since this action is permanent, you will be prompted to confirm prior to deleting.