Tutorials teach you how to write Terraform, but don’t teach you what happens when 60 engineers start writing it together.

When you learn Terraform, you work with a single repository, state file, and a single environment. You run terraform apply from your laptop, and your infrastructure is provisioned.

That model works fine until the day you join a company and realize engineers rarely apply to production from a laptop. A lot of what you see will not match what you practiced.

This article explains how large engineering teams actually run Terraform — the repositories, workflows, ownership rules, and what goes wrong without them.

You’ll learn how enterprise teams structure repositories and state files, how they store and version reusable modules through GitHub, why infrastructure changes move to production through pipelines, how they catch changes that happen outside of Terraform, and how they recover when things go wrong.

Every practice here exists because a team hit a specific wall and built something to get past it.

Prerequisites

You should be comfortable with Terraform before reading this. You should also know how Git pull requests and branch merging work.

This is not a Terraform introduction — it is about what happens after you have learned the basics and start sharing infrastructure with other engineers.

Table of Contents

How State Corruption Happens

The state file is how Terraform tracks what it has built. It remembers every resource, every ID, and every configuration value. When it gets out of sync with what actually exists in the cloud, that’s state corruption.

It gets blamed for a lot of things. But engineers who have dealt with it in production know it usually traces back to one of a handful of situations, each with a different cause and a different fix.

Two Engineers Run terraform apply at the Same Time

Before understanding this one, you need to understand something about how Terraform works.

When you run terraform apply, two things happen separately:

When you run terraform apply, two things happen separately. Step 1: Terraform tells AWS to create the subnet and AWS creates it in the cloud. Step 2: Terraform updates the state file to record that the subnet now exists. AWS holds the real infrastructure. The state file is Terraform's notebook about it. They are separate and can get out of sync.

First, Terraform talks to AWS, and the resource gets created in the cloud. Second, Terraform updates the state file to record what was just built.

These are two different systems. AWS holds the real infrastructure, and the state file is Terraform’s notebook about it. If anything interrupts the process between step one and step two, they fall out of sync.

Now here’s what happens when two engineers apply at the same time without locking:

Diagram showing Sarah and Marcus both open the same Terraform state file at the same time. Sarah reads the state, adds a subnet, and saves. Marcus reads the same original state, updates the NAT gateway, and saves last. His save overwrites Sarah's. The final state file contains the NAT gateway update but the subnet record is gone, even though the subnet still exists in AWS.

Sarah opens the state file and starts adding a subnet. Marcus opens the same state file at the same moment and starts updating a NAT gateway. Both are working from the same starting copy.

Sarah finishes first. Her apply creates the subnet in AWS and updates the state file to record it.

Marcus finishes second. His apply updates the NAT gateway in AWS. Terraform then updates the state file using the version of state Marcus read when his apply started.

That version didn’t include Sarah’s subnet, so the updated state no longer contains a record of it.

Comparison showing AWS contains both the subnet and NAT gateway update, while Terraform's state file is missing the subnet record.

The subnet exists in AWS. But Terraform’s notebook no longer has a record of it. The next terraform plan thinks the subnet was never created and proposes building it again.

State locking prevents this. Sarah’s apply acquires a lock before it starts. When Marcus tries to apply, Terraform makes him wait.

After Sarah finishes, Terraform updates the state file and releases the lock. Marcus then runs against the updated state, so both the subnet and NAT gateway changes are recorded correctly.

An Apply Gets Interrupted

A GitHub Actions pipeline is applying changes to the payments infrastructure, adding three new security group rules and a database parameter group. Halfway through, the pipeline runner hits its 60-minute timeout limit, and the job gets killed.

Here’s what the apply actually managed to do before dying:

A terminal showing terraform apply running. Three security group rules are created successfully at 12:00. At 12:00:07, the database parameter group starts creating. At 12:01:30, two errors appear in red: Job exceeded maximum runtime 60m and Runner terminated. A pipeline summary below shows security group rules 1, 2, and 3 as created with green checkmarks, database parameter as not created with a red X, and state file update as never wrote because the job died first, also with a red X.

Three security group rules complete successfully before the pipeline hits its 60-minute runtime limit. The runner is then terminated. The database parameter group never finishes creating, and the state file update never runs because the job died first.

Security group rule 1  → created ✓
Security group rule 2  → created ✓
Security group rule 3  → created ✓
Database parameter     → not created ✗
State file update      → never wrote (job died first)

The three security group rules now exist in AWS. The problem is that the pipeline died before Terraform could finish updating the state file. AWS knows the rules exist. Terraform’s state file does not.

At this point, reality and the state file no longer match.

Fortunately, this is usually easy to recover from. When the pipeline runs again, Terraform checks what already exists in AWS. It sees the three security group rules and doesn’t try to create them again. It then creates the database parameter group that never got built.

The second run completes successfully and the state file catches up.

This works because Terraform is idempotent — running the same configuration again moves infrastructure toward the desired state rather than blindly creating everything from scratch.

One small complication remains: the state lock.

If the pipeline was interrupted while holding a lock, Terraform may still think another apply is running. The next pipeline run fails immediately with an error like this:

Terminal showing terraform apply failing because the previous job left a state lock behind. The error includes the lock ID, the path to the state file, and the name of the process that acquired it. Terraform refuses to proceed until the lock is released or manually cleared.

Before clearing the lock, make sure no Terraform apply is still running.

Open your CI/CD system — GitHub Actions, GitLab CI, Jenkins, or whatever your team uses — and check the pipeline history for that environment:

The GitHub Actions pipeline history shows four recent runs: terraform-plan completed successfully, two terraform-apply runs failed, and one is currently in progress.

Once you have confirmed no apply is actively running, you can safely release the stale lock and retry the pipeline. The interrupted apply’s partial work will be detected by Terraform’s state reconciliation on the next run, and the remaining resources will be created to bring the state back in sync with the desired configuration.