@@ -10,15 +10,16 @@ jobs on AWS.
10
10
11
11
## Requirements
12
12
13
- 1 . ` pip install boto3 `
14
- 2 . ` git clone https://github.com/pytorch/elastic.git `
13
+ 1 . ` git clone https://github.com/pytorch/elastic.git `
14
+ 2 . ` cd elastic/aws && pip install -r requirements.txt `
15
15
3 . The following AWS resources:
16
16
1 . EC2 [ instance profile] ( https://docs.aws.amazon.com/IAM/latest/UserGuide/id_roles_use_switch-role-ec2_instance-profiles.html )
17
17
2 . [ Subnet(s)] ( https://docs.aws.amazon.com/vpc/latest/userguide/default-vpc.html#create-default-subnet )
18
18
3 . [ Security group] ( https://docs.aws.amazon.com/vpc/latest/userguide/VPC_SecurityGroups.html#DefaultSecurityGroup )
19
19
4 . EFS volume
20
20
5 . S3 Bucket
21
-
21
+ 4 . [ Install] ( https://docs.aws.amazon.com/systems-manager/latest/userguide/session-manager-working-with-install-plugin.html )
22
+ the AWS Session Manager plugin
22
23
23
24
## Quickstart
24
25
@@ -69,7 +70,7 @@ you have downloaded the imagenet dataset to `/mnt/efs/fs1/data/imagenet/train`.
69
70
To run the script we'll use ` petctl ` ,
70
71
71
72
``` bash
72
- python3 petctl.py run_job --size 2 --min_size 1 --max_size 3 --name ${USER} -job examples/imagenet/main.py -- --input_path /mnt/efs/fs1/data/imagenet/train
73
+ python3 aws/ petctl.py run_job --size 2 --min_size 1 --max_size 3 --name ${USER} -job examples/imagenet/main.py -- --input_path /mnt/efs/fs1/data/imagenet/train
73
74
```
74
75
75
76
In the example above, the named arguments, such as, ` --size ` , ` --min_size ` , and
@@ -158,20 +159,20 @@ You can take a look at their console outputs by running
158
159
159
160
``` bash
160
161
# see the status of the worker
161
- systemctl status torchelastic_worker
162
+ sudo systemctl status torchelastic_worker
162
163
# get the container id
163
- docker ps
164
+ sudo docker ps
164
165
# tail the container logs
165
- docker logs -f < container id>
166
+ sudo docker logs -f < container id>
166
167
```
167
168
168
169
> Note since we have configured the log driver to be ` awslogs ` tailing
169
170
the docker logs will not work. For more information see: https://docs.docker.com/config/containers/logging/awslogs/
170
171
171
172
You can also manually stop and start the workers by running
172
173
``` bash
173
- systemctl stop torchelastic_worker
174
- systemctl start torchelastic_worker
174
+ sudo systemctl stop torchelastic_worker
175
+ sudo systemctl start torchelastic_worker
175
176
```
176
177
177
178
> ** EXCERCISE:** Try stopping or adding worker(s) to see elasticity in action!
@@ -188,7 +189,7 @@ that is monitoring the job!). In practice consider using EKS, Batch, or SageMake
188
189
To stop the job and tear down the resources, use the ` kill_job ` command:
189
190
190
191
``` bash
191
- python3 petctl.py kill_job --name ${USER} -job
192
+ python3 petctl.py kill_job ${USER} -job
192
193
```
193
194
194
195
You'll notice that the two ASGs created with the ` run_job ` command are deleted.
0 commit comments