Tutorials teach you how to write Terraform, but don’t teach you what happens when 60 engineers start writing it together.
When you learn Terraform, you work with a single repository, state file, and a single environment. You run terraform apply from your laptop, and your infrastructure is provisioned.
That model works fine until the day you join a company and realize engineers rarely apply to production from a laptop. A lot of what you see will not match what you practiced.
This article explains how large engineering teams actually run Terraform — the repositories, workflows, ownership rules, and what goes wrong without them.
You’ll learn how enterprise teams structure repositories and state files, how they store and version reusable modules through GitHub, why infrastructure changes move to production through pipelines, how they catch changes that happen outside of Terraform, and how they recover when things go wrong.
Every practice here exists because a team hit a specific wall and built something to get past it.
Prerequisites
You should be comfortable with Terraform before reading this. You should also know how Git pull requests and branch merging work.
This is not a Terraform introduction — it is about what happens after you have learned the basics and start sharing infrastructure with other engineers.
Table of Contents
- How State Corruption Happens
- Why State File Gets Treated Like a Production Database
- How Enterprise Teams Structure Their Terraform Repositories
- How Teams Split State Files to Protect Each Other
- Why Some Teams Prefer Directories Over Workspaces for Production
- How Teams Share Infrastructure Through Modules on GitHub
- How Teams Version and Release Terraform Modules
- How Teams Maintain Terraform Modules at Scale
- How Teams Share Data Between State Files
- How Infrastructure Changes Actually Move to Production
- How Teams Detect Infrastructure Drift
- How Teams Recover When State Goes Wrong
- Conclusion
How State Corruption Happens
The state file is how Terraform tracks what it has built. It remembers every resource, every ID, and every configuration value. When it gets out of sync with what actually exists in the cloud, that’s state corruption.
It gets blamed for a lot of things. But engineers who have dealt with it in production know it usually traces back to one of a handful of situations, each with a different cause and a different fix.
Two Engineers Run terraform apply at the Same Time
Before understanding this one, you need to understand something about how Terraform works.
When you run terraform apply, two things happen separately:

First, Terraform talks to AWS, and the resource gets created in the cloud. Second, Terraform updates the state file to record what was just built.
These are two different systems. AWS holds the real infrastructure, and the state file is Terraform’s notebook about it. If anything interrupts the process between step one and step two, they fall out of sync.
Now here’s what happens when two engineers apply at the same time without locking:

Sarah opens the state file and starts adding a subnet. Marcus opens the same state file at the same moment and starts updating a NAT gateway. Both are working from the same starting copy.
Sarah finishes first. Her apply creates the subnet in AWS and updates the state file to record it.
Marcus finishes second. His apply updates the NAT gateway in AWS. Terraform then updates the state file using the version of state Marcus read when his apply started.
That version didn’t include Sarah’s subnet, so the updated state no longer contains a record of it.

The subnet exists in AWS. But Terraform’s notebook no longer has a record of it. The next terraform plan thinks the subnet was never created and proposes building it again.
State locking prevents this. Sarah’s apply acquires a lock before it starts. When Marcus tries to apply, Terraform makes him wait.
After Sarah finishes, Terraform updates the state file and releases the lock. Marcus then runs against the updated state, so both the subnet and NAT gateway changes are recorded correctly.
An Apply Gets Interrupted
A GitHub Actions pipeline is applying changes to the payments infrastructure, adding three new security group rules and a database parameter group. Halfway through, the pipeline runner hits its 60-minute timeout limit, and the job gets killed.
Here’s what the apply actually managed to do before dying:

Three security group rules complete successfully before the pipeline hits its 60-minute runtime limit. The runner is then terminated. The database parameter group never finishes creating, and the state file update never runs because the job died first.
Security group rule 1 → created ✓
Security group rule 2 → created ✓
Security group rule 3 → created ✓
Database parameter → not created ✗
State file update → never wrote (job died first)
The three security group rules now exist in AWS. The problem is that the pipeline died before Terraform could finish updating the state file. AWS knows the rules exist. Terraform’s state file does not.
At this point, reality and the state file no longer match.
Fortunately, this is usually easy to recover from. When the pipeline runs again, Terraform checks what already exists in AWS. It sees the three security group rules and doesn’t try to create them again. It then creates the database parameter group that never got built.
The second run completes successfully and the state file catches up.
This works because Terraform is idempotent — running the same configuration again moves infrastructure toward the desired state rather than blindly creating everything from scratch.
One small complication remains: the state lock.
If the pipeline was interrupted while holding a lock, Terraform may still think another apply is running. The next pipeline run fails immediately with an error like this:

Before clearing the lock, make sure no Terraform apply is still running.
Open your CI/CD system — GitHub Actions, GitLab CI, Jenkins, or whatever your team uses — and check the pipeline history for that environment:

Once you have confirmed no apply is actively running, you can safely release the stale lock and retry the pipeline. The interrupted apply’s partial work will be detected by Terraform’s state reconciliation on the next run, and the remaining resources will be created to bring the state back in sync with the desired configuration.