trainML Training Jobs allow you to effortlessly run parallel model training experiments across dozens of GPUs.
GPU-enabled servers tend to suffer from the twin problems of utilization and contention. Due to their significant expense, companies want to ensure their servers are being maximally used. Every minute the GPU is idle is a waste of their investment. On the flip-side, when a server is maximally utilized, processes must be implemented to schedule training experiments between users, frequently wasting even more expensive data scientist time.
GPU-enabled cloud instances seek to ameliorate this problem, but come with their own trade-offs. They are dramatically more expensive than normal CPU instances. Instance setup (loading model code and datasets) and teardown (copying artifacts and other data to permanent storage) can consume a significant amount of time (and cost). Forgetting to stop the instances after training completes can be a very costly mistake.
trainML Training Jobs are the best of both. All you have to do is provide the git URL for the model code, create or select a dataset to run the training on, specify where you want the model artifacts to be sent, and the command to execute the training script. trainML automatically downloads the required inputs, runs your training script, uploads the results, and terminates itself when it's done. With our on-demand cloud infrastructure, you never have to worry about utilization or contention.
Frequently, customers want to compare the results of different model architectures or different sets of hyperparameters on the same datasets. Using a single server, you would have to wait until each training run finished to being the next experiment, wasting valuable time. If you run each experiment on a separate server, you would have to incur the setup and teardown costs many more times.
With trainML Training Jobs, scaling an experiment to run multiple workers in parallel as simple as adding more workers to the training job. You can then specify a command with different command line arguments for each worker, but reuse the same model and dataset. If you're using an optimization library like hyperopt, you can use the same command for each worker and let the framework do the work. You can find an example this here.
A significant challenge with running multiple training experiments in parallel is monitoring their progress. The trainML platform captures and centralizes the log output of all workers running an experiment. You can view output in real-time for all workers simultaneously, or filter down to watch each worker individually. By monitoring their progress, you can easily detect if you need to stop a worker early and save your money for more productive experiments.
trainML Training Jobs can automatically upload their results to cloud storage providers for permanent storage. However, customer usually prefer to analyze the results of a training job in their local environment first. In this case, you can use the local storage option to allow the training job's workers to upload their results directly to your local computer. This advanced function requires the use of the trainML local connection capability, so ensure you satisfy the prerequisites before attempting to use the local data option.
Read about the process for starting and monitoring a trainML Training Job.
Learn MoreGet started creating a trainML Training Job
Try It NowUse a simple image classification tutorial to familiarize yourself with training jobs.
Learn MoreFind out more about instance and storage billing and credits
Learn More