How to make Terraform waiting for cloud-init to finish on EC2 without SSH

Terraform is a powerful tool, but it doesn't have a way to wait for EC2 instances to be ready, instead of just created. We will see how to use AWS SSM to do just that.

| Published

Terraform logo, courtesy of HashiCorp.

I find using SSH in Terraform quite problematic: you need to distribute a private SSH key to anybody that will launch the Terraform script, including your CI/CD system. This is a no-go for me: it adds the complexity to manage SSH keys, including their rotation. There is a huge issue on the Terraform repo on GitHub about this functionality, and the most voted solution is indeed connecting via SSH to run a check:

1provisioner "remote-exec" {
2 inline = [
3 "cloud-init status --wait"
4 ]

AWS Systems Manager Run Command

The idea of using cloud-init status --wait is indeed quite good. The only problem is how do we ask Terraform to run such command. Luckily for us, AWS has a service, AWS SSM Run Command that allow us to run commands on an EC2 instance through AWS APIs! In this way, our CI/CD system needs only an appropriate IAM role, and a way to invoke AWS APIs. I use the AWS CLI in the examples below, but you can adapt them to any language you prefer.


There are some prerequisites to use AWS SSM Run Command: we need to have AWS SSM Agent installed on our instance. It is preinstalled on Amazon Linux 2 and Ubuntu 16.04, 18.04, and 20.04. For any other OS, we need to install it manually: it is supported on Linux, macOS, and Windows.

The user or the role that executes the Terraform code needs to be able to create, update, and read AWS SSM Documents, and run SSM commands. A possible policy could be look like this:

2 "Version": "2012-10-17",
3 "Statement": [
4 {
5 "Sid": "Stmt1629387563127",
6 "Action": [
7 "ssm:CreateDocument",
8 "ssm:DeleteDocument",
9 "ssm:DescribeDocument",
10 "ssm:DescribeDocumentParameters",
11 "ssm:DescribeDocumentPermission",
12 "ssm:GetDocument",
13 "ssm:ListDocuments",
14 "ssm:SendCommand",
15 "ssm:UpdateDocument",
16 "ssm:UpdateDocumentDefaultVersion",
17 "ssm:UpdateDocumentMetadata"
18 ],
19 "Effect": "Allow",
20 "Resource": "*"
21 }
22 ]

If we already know the name of the documents, or the instances where we want to run the commands, it is better to lock down the policy specifying the resources, accordingly to the principle of least privilege.

Last but not least, we need to have the AWS CLI installed on the system that will execute Terraform.

The Terraform code

After having set up the prerequisites as above, we need two different Terraform resources. The first will create the AWS SSM Document with the command we want to execute on the instance. The second one will execute such command while provisioning the EC2 instance.

The AWS SSM Document code will look like this:

1resource "aws_ssm_document" "cloud_init_wait" {
2 name = "cloud-init-wait"
3 document_type = "Command"
4 document_format = "YAML"
5 content = <<-DOC
6 schemaVersion: '2.2'
7 description: Wait for cloud init to finish
8 mainSteps:
9 - action: aws:runShellScript
10 name: StopOnLinux
11 precondition:
12 StringEquals:
13 - platformType
14 - Linux
15 inputs:
16 runCommand:
17 - cloud-init status --wait
18 DOC

We can refer such document from within our EC2 instance resource, with a local provisioner:

1resource "aws_instance" "example" {
2 ami = "my-ami"
3 instance_type = "t3.micro"
5 provisioner "local-exec" {
6 interpreter = ["/bin/bash", "-c"]
8 command = <<-EOF
9 set -Ee -o pipefail
10 export AWS_DEFAULT_REGION=${}
12 command_id=$(aws ssm send-command --document-name ${aws_ssm_document.cloud_init_wait.arn} --instance-ids ${} --output text --query "Command.CommandId")
13 if ! aws ssm wait command-executed --command-id $command_id --instance-id ${}; then
14 echo "Failed to start services on instance ${}!";
15 echo "stdout:";
16 aws ssm get-command-invocation --command-id $command_id --instance-id ${} --query StandardOutputContent;
17 echo "stderr:";
18 aws ssm get-command-invocation --command-id $command_id --instance-id ${} --query StandardErrorContent;
19 exit 1;
20 fi;
21 echo "Services started successfully on the new instance with id ${}!"
23 EOF
24 }

From now on, Terraform will wait for cloud-init to complete before marking the instance ready.


AWS Session Manager, AWS Run Commands, and the others tools in the AWS Systems Manager family are quite powerful, and in my experience they are not widely use. I find them extremely useful: for example, they also allows connecting via SSH to the instances without having any port open, included the 22! Basically, they allow managing and running commands inside instances only through AWS APIs, with a lot of benefits, as they explain:

Session Manager provides secure and auditable instance management without the need to open inbound ports, maintain bastion hosts, or manage SSH keys. Session Manager also allows you to comply with corporate policies that require controlled access to instances, strict security practices, and fully auditable logs with instance access details, while still providing end users with simple one-click cross-platform access to your managed instances.

Do you have any questions, feedback, critics, request for support? Leave a comment below, or drop me an email at