This guide will help you learn how Replicate works by building a simple model.
If you prefer working in notebooks, follow our notebook tutorial on Colab.
We're going to make a model that classifies Iris plants, trained on the Iris dataset. It's an intentionally simple model that trains really fast, just so we can show you how Replicate works.
First, let's make a directory to work in:
mkdir iris-classifiercd iris-classifier
Replicate is a Python package, and we need a few other Python packages to make the model run.
Create requirements.txt
to define the Python packages to install:
replicate==0.2.0scikit-learn==0.23.1torch==1.4.0
Then, install them:
pip install -r requirements.txt
You might want to use a Virtualenv or Conda so these packages don't collide with others on your computer.
Copy and paste this code into train.py
:
import argparseimport replicatefrom sklearn.datasets import load_irisfrom sklearn.model_selection import train_test_splitfrom sklearn.utils import shuffleimport torchfrom torch import nnfrom torch.autograd import Variabledef train(learning_rate, num_epochs):# Create an "experiment". This represents a run of your training script.# It saves the training code at the given path and any hyperparameters.experiment = replicate.init(path=".",params={"learning_rate": learning_rate, "num_epochs": num_epochs},)print("Downloading data set...")iris = load_iris()train_features, val_features, train_labels, val_labels = train_test_split(iris.data,iris.target,train_size=0.8,test_size=0.2,random_state=0,stratify=iris.target,)train_features = torch.FloatTensor(train_features)val_features = torch.FloatTensor(val_features)train_labels = torch.LongTensor(train_labels)val_labels = torch.LongTensor(val_labels)torch.manual_seed(0)model = nn.Sequential(nn.Linear(4, 15), nn.ReLU(), nn.Linear(15, 3),)optimizer = torch.optim.Adam(model.parameters(), lr=learning_rate)criterion = nn.CrossEntropyLoss()for epoch in range(num_epochs):model.train()optimizer.zero_grad()outputs = model(train_features)loss = criterion(outputs, train_labels)loss.backward()optimizer.step()with torch.no_grad():model.eval()output = model(val_features)acc = (output.argmax(1) == val_labels).float().sum() / len(val_labels)print("Epoch {}, train loss: {:.3f}, validation accuracy: {:.3f}".format(epoch, loss.item(), acc))torch.save(model, "model.pth")# Create a checkpoint within the experiment.# This saves the metrics at that point, and makes a copy of the file# or directory given, which could weights and any other artifacts.experiment.checkpoint(path="model.pth",step=epoch,metrics={"loss": loss.item(), "accuracy": acc},primary_metric=("loss", "minimize"),)if __name__ == "__main__":parser = argparse.ArgumentParser()parser.add_argument("--learning_rate", type=float, default=0.01)parser.add_argument("--num_epochs", type=int, default=100)args = parser.parse_args()train(args.learning_rate, args.num_epochs)
Notice there are two highlighted lines that call Replicate. They don't affect the behavior of the training – they just save data in Replicate to keep track of what is going on.
The first is replicate.init()
. This creates an experiment, which represents a run of your training script. The experiment records the hyperparameters you pass to it and makes a copy of the given path to save your training code.
The second is experiment.checkpoint()
. This creates a checkpoint within the experiment. The checkpoint saves the metrics at that point, and makes a copy of the file or directory you pass to it, which could include weights and any other artifacts.
Each experiment contains multiple checkpoints. You typically save your model periodically during training, because the best result isn't necessarily the most recent one. A checkpoint is created just after you save your model, so Replicate can keep track of versions of your saved model.
We need to tell Replicate where to store your experiments. Create replicate.yaml
with this content:
repository: "file://.replicate"
This will store your experiments in the .replicate
directory relative to this file. You can also store them on Amazon S3 or Google Cloud Storage if you want to store data in the cloud. Learn more about this in our guide to usage cloud storage.
We're now going to train this model a couple of times with different parameters to see what we can do with Replicate.
First, train it with default parameters:
$ python train.pyEpoch 0, train loss: 1.184, validation accuracy: 0.333Epoch 1, train loss: 1.117, validation accuracy: 0.333Epoch 2, train loss: 1.061, validation accuracy: 0.467...Epoch 97, train loss: 0.121, validation accuracy: 1.000Epoch 98, train loss: 0.119, validation accuracy: 1.000Epoch 99, train loss: 0.118, validation accuracy: 1.000
Next, run the training with a different learning rate:
$ python train.py --learning_rate=0.2Epoch 0, train loss: 1.184, validation accuracy: 0.333Epoch 1, train loss: 1.161, validation accuracy: 0.633Epoch 2, train loss: 1.124, validation accuracy: 0.667...Epoch 97, train loss: 0.057, validation accuracy: 0.967Epoch 98, train loss: 0.057, validation accuracy: 0.967Epoch 99, train loss: 0.056, validation accuracy: 0.967
The calls to the replicate
Python library have saved your experiments locally. You can use replicate ls
to list them:
$ replicate lsEXPERIMENT STARTED STATUS LEARNING_RATE LATEST CHECKPOINT LOSS BEST CHECKPOINT LOSSb90ad56 12 seconds ago stopped 0.01 4941495 (step 99) 0.1176 4941495 (step 99) 0.11769cce006 3 seconds ago stopped 0.2 a122e85 (step 99) 0.056486 a122e85 (step 99) 0.056486
The --filter
flag allows you to narrow in on a subset of experiments:
$ replicate ls --filter "learning_rate = 0.2"EXPERIMENT STARTED STATUS LEARNING_RATE LATEST CHECKPOINT LOSS BEST CHECKPOINT LOSS9cce006 3 seconds ago stopped 0.2 a122e85 (step 99) 0.056486 a122e85 (step 99) 0.056486
As a reminder, this is a list of experiments which represents runs of the train.py
script. They store a copy of the code as it was when the script was started.
Within experiments are checkpoints, which are created every time you call experiment.checkpoint()
in your training script. The checkpoint contains your weights, Tensorflow logs, and any other artifacts you want to save.
To list the checkpoints within an experiment, you can use replicate show
. Run this, replacing b90ad56
with an experiment ID from your output of replicate ls
:
$ replicate show b90ad56Experiment: b90ad56a755371548ae2ab98c9d40a85911fd6198254880e600cdf00f55a18caCreated: Wed, 02 Sep 2020 20:44:51 PDTStatus: stoppedHost: 107.133.144.125User: benCommand: train.pyParamslearning_rate: 0.01num_epochs: 100CheckpointsID STEP CREATED ACCURACY LOSS9ed04a2 0 5 minutes ago 0.33333 1.1836b37e01d 1 5 minutes ago 0.33333 1.1173c74e9c6 2 5 minutes ago 0.46667 1.06117ba5b47 3 5 minutes ago 0.63333 1.0138886f612 4 5 minutes ago 0.7 0.97689667fdba 5 5 minutes ago 0.9 0.9496...cd1223c 95 5 minutes ago 1 0.12417510eb98 96 5 minutes ago 1 0.1224459129de 97 5 minutes ago 1 0.12076e301a55 98 5 minutes ago 1 0.119154941495 99 5 minutes ago 1 0.1176 (best)
You can also use replicate show
on a checkpoint to get all the information about it. Run this, replacing 494
with a checkpoint ID from the experiment:
$ replicate show 494Checkpoint: 49414952394edfdf7923edd6bfb4aabe5558a6276a02a71a5965e1622ee7b9fdCreated: Wed, 02 Sep 2020 20:44:52 PDTPath: model.pthStep: 99ExperimentID: b90ad56a755371548ae2ab98c9d40a85911fd6198254880e600cdf00f55a18caCreated: Wed, 02 Sep 2020 20:44:51 PDTStatus: stoppedHost: 107.133.144.125User: benCommand: train.pyParamslearning_rate: 0.01num_epochs: 100Metricsaccuracy: 1loss: 0.11759971082210541 (primary, minimize)
Notice you can pass a prefix to replicate show
, and it'll automatically find
the experiment that starts with just those characters. Saves a few keystrokes.
Let's compare the last checkpoints from the two experiments we ran. Run this, replacing 4941495
and a122e85
with the two checkpoint IDs from the LATEST CHECKPOINT
column in replicate ls
:
$ replicate diff 4941495 a122e85ExperimentID: b90ad56 9cce006Command: train.py train.py --learning_rate=0.2Created: Wed, 02 Sep 2020 20:44:51 PDT Wed, 02 Sep 2020 20:45:01 PDTParamslearning_rate: 0.01 0.2CheckpointID: 4941495 a122e85Created: Wed, 02 Sep 2020 20:44:55 PDT Wed, 02 Sep 2020 20:45:04 PDTMetricsaccuracy: 1 0.9666666388511658loss: 0.11759971082210541 0.056485891342163086
replicate diff
works a bit like git diff
, except in addition to the code, it compares all of the metadata that Replicate is aware of: params, metrics, dependencies, and so on.
replicate diff
compares checkpoints, because that is the thing that actually has all the results.
You can also pass an experiment ID, and it will pick the best or latest checkpoint from that experiment.
At some point you might want to get back to some point in the past. Maybe you've run a bunch of experiments in parallel, and you want to choose one that works best. Or, perhaps you've gone down a line of exploration and it's not working, so you want to get back to where you were a week ago.
The replicate checkout
command will copy the code and weights from a checkpoint into your working directory. Run this, replacing 4941495
with a checkpoint ID you passed to replicate diff
:
$ replicate checkout 4941495═══╡ The directory "/Users/ben/p/tmp/iris-classifier" is not empty.═══╡ This checkout may overwrite existing files. Make sure you've committed everything to Git so it's safe!Do you want to continue? (y/N) y═══╡ Checked out 4941495 to "/Users/ben/p/tmp/iris-classifier"
The model file in your working directory is now the model saved in that checkpoint:
$ ls -lh model.pth-rw-r--r-- 1 ben staff 8.3K Aug 7 16:42 model.pth
This is useful for getting the trained model out of a checkpoint, but it also copies all of the code from the experiment that checkpoint is part of. If you made a change to the code and didn't commit to Git, replicate checkout
will allow you get back the exact code from an experiment.
This means you don't have to remember to commit to Git when you're running experiments. Just try a bunch of things, then when you've found something that works, use Replicate to get back to the exact code that produced those results and formally commit it to Git.
Neat, huh? Replicate is keeping track of everything in the background so you don't have to.
With these tools, let's recap what the workflow looks like:
experiment = replicate.init()
and experiment.checkpoint()
to your training code.replicate ls
and replicate show
.replicate diff
.replicate checkout
.You don't have to keep track of what you changed in your experiments, because Replicate does that automatically for you. You can also safely change things without committing to Git, because replicate checkout
will always be able to get you back to the exact environment the experiment was run in.
Next, you might want to:
If something doesn't make sense, doesn't work, or you just have some questions, please email us: team@replicate.ai. We love hearing from you!