A practitioner's blueprint covering assessment automation, Landing Zone IaC, CI/CD pipeline design, governance guardrails, cutover orchestration, and full operationalisation — with real scripts and Terraform patterns you can use today.
Migrating hundreds of enterprise workloads to Azure is not a lift-and-shift exercise — it is a disciplined engineering programme. This post distils a proven six-phase delivery framework, complete with automation scripts, reusable Terraform patterns, CI/CD pipeline templates, and governance guardrails, so your team can replicate it at scale.
Each phase is time-boxed, produces version-controlled artefacts in Azure Repos, and gates the next phase on signed acceptance criteria. The full programme spans 24–36 weeks for a 100–500 workload migration.
The first two sprints are entirely about data. You cannot design a migration without an accurate, scored, dependency-aware inventory. Azure Migrate provides the platform; PowerShell and Python provide the automation layer on top of it.
Instead of clicking through the portal, deploy the Azure Migrate project and register all required resource providers in a single repeatable PowerShell script. This ensures every programme environment — DEV, QA, PROD — is configured identically.
# Create Azure Migrate project + register required resource providers param( [Parameter(Mandatory)] [string]$ResourceGroupName, [Parameter(Mandatory)] [string]$ProjectName, [Parameter(Mandatory)] [string]$Location, [Parameter(Mandatory)] [string]$SubscriptionId ) Connect-AzAccount -TenantId $env:ARM_TENANT_ID Set-AzContext -SubscriptionId $SubscriptionId New-AzResourceGroup -Name $ResourceGroupName -Location $Location -Force # Register all providers needed for Azure Migrate $providers = @('Microsoft.Migrate', 'Microsoft.OffAzure', 'Microsoft.DataMigration', 'Microsoft.HybridCompute') $providers | ForEach-Object { Register-AzResourceProvider -ProviderNamespace $_ | Out-Null } # Create the Azure Migrate project $resourceId = "/subscriptions/$SubscriptionId/resourceGroups/" + "$ResourceGroupName/providers/Microsoft.Migrate/" + "MigrateProjects/$ProjectName" New-AzResource -ResourceId $resourceId ` -ApiVersion "2023-06-06" -Properties @{} ` -Location $Location -Force Write-Host "Project ready. Deploy OVA and register appliance via portal." -ForegroundColor Green
After 30 days of observation, the Python export script calls the Azure Migrate REST API, scores each server by CPU cores, memory, and disk count, and auto-assigns Wave 1 / 2 / 3 — the bottom 40% (simplest) go first, the most complex 20% go last.
def score_workload(server: dict) -> dict: props = server.get('properties', {}) cores = props.get('numberOfProcessorCore', 0) mem = props.get('allocatedMemoryInMB', 0) / 1024 disks = len(props.get('disks', {})) # Weighted complexity — higher = migrate later score = (cores * 2) + (mem * 0.5) + (disks * 3) tier = ('T1-Critical' if score > 40 else 'T2-Standard' if score > 20 else 'T3-Simple') return {'ComplexityScore': round(score, 1), 'SuggestedTier': tier, ...} # Auto wave assignment: bottom 40% → Wave 1, mid 40% → Wave 2, top 20% → Wave 3 total = len(scored) for i, s in enumerate(scored): s['Wave'] = ('Wave-1' if i < total * 0.4 else 'Wave-2' if i < total * 0.8 else 'Wave-3')
Never enter Phase 2 without a signed-off wave plan. Every Landing Zone design decision — address space, firewall rules, subnet sizing — is shaped by the workload inventory produced in Phase 1.
The Landing Zone is the foundation every migrated workload lands on. It must be deployed before a single VM is replicated, and it must be entirely defined in Terraform — no manual portal steps, no snowflake configuration.
The hub VNet hosts Azure Firewall Premium, ExpressRoute/VPN Gateway, Bastion, and the Private DNS Resolver. Spoke VNets are created per workload subscription using for_each, each peered to the hub with gateway transit enabled. All traffic between spokes flows through the hub firewall — no direct spoke-to-spoke paths.
# Spoke VNets — one per workload subscription, created with for_each resource "azurerm_virtual_network" "spoke" { for_each = var.spokes name = "vnet-${each.key}-${var.environment}-${var.location_short}" resource_group_name = each.value.resource_group_name address_space = [each.value.address_space] dns_servers = var.custom_dns_servers tags = local.common_tags } # Hub → Spoke peering with gateway transit resource "azurerm_virtual_network_peering" "hub_to_spoke" { for_each = var.spokes name = "peer-hub-to-${each.key}" virtual_network_name = azurerm_virtual_network.hub.name remote_virtual_network_id = azurerm_virtual_network.spoke[each.key].id allow_forwarded_traffic = true allow_gateway_transit = true } # Spoke → Hub peering with remote gateway use resource "azurerm_virtual_network_peering" "spoke_to_hub" { for_each = var.spokes name = "peer-${each.key}-to-hub" virtual_network_name = azurerm_virtual_network.spoke[each.key].name remote_virtual_network_id = azurerm_virtual_network.hub.id use_remote_gateways = true }
Before any terraform init, the PowerShell backend script creates a GRS storage account with soft-delete, blob versioning, and a CanNotDelete resource lock — preventing accidental state loss for the duration of the programme. The backend configuration block is printed to the terminal ready to paste into backend.tf.
Every Terraform repository in the programme — Landing Zone, pattern library, governance — consumes the same reusable Azure DevOps YAML template stored in pipelines-templates/terraform-cicd.yml. This single template enforces six mandatory stages on every PR merge:
terraform fmt --check and tflint — zero errors required to pass.terraform apply -auto-approve on PRODThe approval gate exists for a reason. Even if the plan looks identical to DEV, a reviewer's eyes catch destructive changes that automation misses — particularly when resource dependencies shift between migration waves.
Primary IaC engine. Version pinned in versions.tf. tfenv manages version switching across environments.
Security misconfiguration and CIS benchmark compliance. Hard-fail gates on HIGH/CRITICAL findings.
Multi-stage YAML pipelines. Single reusable template consumed by all infra repos via resource reference.
No direct commits to main. Every infra change is a reviewed PR with Terraform plan output posted as a comment.
Go-based integration testing for Terraform modules. Runs post-apply in isolated test subscriptions.
Auto-generates module README from variable and output declarations. Runs as a CI step on every merge to main.
Every workload from the wave plan is assigned one of six migration patterns before Sprint 9. The pattern determines which Terraform module, runbook, and cutover script applies — there are no ad-hoc decisions on cutover night.
| Pattern | Use Case | IaC Approach | Speed |
|---|---|---|---|
| Rehost Lift & Shift | Migrate VM as-is via ASR replication | ASR Terraform module + DNS cutover script | Fastest |
| Replatform | Minor OS or DB optimisation during migration | Terraform + Ansible post-config playbook | Fast |
| Refactor | Containerise to AKS or move to PaaS | Terraform PaaS modules + Helm charts | Medium |
| Re-purchase | Replace on-premises app with SaaS (M365, Dynamics) | Azure Marketplace + ARM linked templates | Medium |
| Retire | Decommission — no migration path exists | terraform destroy pipeline + data archive | N/A |
| Retain | Regulatory or latency constraint prevents migration | ExpressRoute + Azure Arc hybrid config | On-prem |
The cutover PowerShell script follows a strict sequence for every wave. Pre-flight checks validate replication health and lag (must be under 5 minutes). The application is quiesced. ASR planned failover is triggered per VM with a wait loop. Private DNS records are then cut over to target VM private IPs. Finally, the Python smoke test suite runs — if any test fails three times, the rollback script fires ASR failback automatically.
# Pre-flight: validate replication health and lag for each VM foreach ($vmName in $VMNames) { $item = Get-AzRecoveryServicesAsrReplicationProtectedItem | Where-Object { $_.FriendlyName -eq $vmName } if ($item.ReplicationHealth -ne "Normal") { throw "$vmName replication health: $($item.ReplicationHealth)" } if ($item.RecoveryPointObjective -gt 300) { Write-Warning "$vmName lag $($item.RecoveryPointObjective)s exceeds 5min" } } # Trigger planned failover and wait for completion foreach ($vmName in $VMNames) { Start-AzRecoveryServicesAsrPlannedFailoverJob ` -ReplicationProtectedItem $item -Direction PrimaryToRecovery | Out-Null } # DNS cutover — update Private DNS A record to new target IP foreach ($vmName in $VMNames) { $privateIp = (Get-AzNetworkInterface -ResourceId $vm.NetworkProfile.NetworkInterfaces[0].Id ` ).IpConfigurations[0].PrivateIpAddress New-AzPrivateDnsRecordSet -ZoneName $DnsZoneName -Name $vmName ` -RecordType A -Ttl 300 ` -PrivateDnsRecords (New-AzPrivateDnsRecordConfig -IPv4Address $privateIp) } # Run smoke test suite — rollback triggered automatically on failure python smoke_tests/run_wave_smoke_tests.py --vms ($VMNames -join ',')
Use ASR's test failover into an isolated network. Validate the application fully, then clean up test VMs. Any DNS or connectivity issues found here are far cheaper to fix than on cutover night.
Governance is deployed on Day 1 of Sprint 3 — before any workload is migrated. Azure Policy is the enforcement layer; Terraform is the delivery mechanism. Every migration wave subscription receives the same mandatory policy initiative at assignment time.
resource "azurerm_policy_set_definition" "migration_governance" { name = "migration-governance-initiative" policy_type = "Custom" display_name = "Azure Migration Governance Initiative" management_group_id = var.management_group_id # Allowed locations — Deny effect policy_definition_reference { policy_definition_id = "/providers/Microsoft.Authorization/policyDefinitions/e56962a6-..." parameter_values = jsonencode({ listOfAllowedLocations = { value = "[parameters('allowedLocations')]" } }) } # Diagnostic settings — DeployIfNotExists policy_definition_reference { policy_definition_id = azurerm_policy_definition.diag_settings.id parameter_values = jsonencode({ workspaceId = { value = "[parameters('logAnalyticsWorkspaceId')]" } }) } } # Assign the initiative at Management Group scope resource "azurerm_management_group_policy_assignment" "migration_governance" { name = "migration-governance" management_group_id = var.management_group_id policy_definition_id = azurerm_policy_set_definition.migration_governance.id location = var.location identity { type = "SystemAssigned" } }
Budget alerts at 80% (warning) and 100% (critical) are deployed via Terraform per subscription. A weekly PowerShell tag compliance report scores every resource group against the mandatory tag taxonomy and emails the result to the cloud governance team — non-compliant resources cannot slip through unnoticed.
A migration that hands over unmonitored, undocumented workloads is not a successful migration. Phase 6 establishes the SRE framework, DR automation pipeline, and knowledge transfer programme that make the client team genuinely self-sufficient.
Azure Monitor scheduled query rules enforce SLOs via KQL. The fast-burn alert fires when the error budget is burning 14× faster than the 30-day allowance — the "page someone immediately" signal, deployed as Terraform for every Tier 1 workload:
// Fast-burn: error budget consuming 14× faster than 30d SLO allows let errorBudgetPct = 1 - 0.9995; // Tier 1 SLO: 99.95% availability let burnMultiple = 14.0; // 14× = page immediately AppRequests | where AppRoleName == 'your-workload-name' | summarize total = count(), failed = countif(Success == false) by bin(TimeGenerated, 5m) | extend errorRate = todouble(failed) / todouble(total) | where errorRate > errorBudgetPct * burnMultiple
A scheduled Azure DevOps pipeline runs on the 1st of every third month at 02:00 UTC. It triggers an ASR test failover into an isolated network, runs the full smoke test suite against the DR VMs, and then cleans up the test environment automatically — all without human intervention. Results publish to the Azure DevOps test reporting dashboard.
# Automated DR test — runs quarterly, no human trigger required schedules: - cron: '0 2 1 */3 *' # 02:00 UTC on 1st of every 3rd month displayName: Quarterly DR Test branches: { include: [main] } always: true stages: - stage: TestFailover jobs: - deployment: dr_test environment: dr-test # isolated environment in Azure DevOps strategy: runOnce: deploy: steps: - script: | python dr_tests/run_dr_test.py \ --workload '$(WORKLOAD_NAME)' \ --vault '$(VAULT_NAME)' \ --network '$(ISOLATED_VNET_ID)' - stage: Cleanup condition: always() # clean up even if tests fail jobs: - job: cleanup_dr_vms steps: - script: | # Removes test failover VMs from the isolated network Get-AzRecoveryServicesAsrReplicationProtectedItem | Where-Object { $_.FailoverRecoveryPointId } | ForEach-Object { Start-AzRecoveryServicesAsrTestFailoverCleanupJob -ReplicationProtectedItem $_ }
Documentation and workshops are necessary but not sufficient. The programme does not close until a client engineer can independently deploy a new module from the pattern library, run a pipeline to PROD, and respond to a Defender for Cloud alert — all without assistance from the migration team. This is witnessed, signed off, and recorded before the contract ends.
Terraform & IaC workshops · Azure DevOps pipeline patterns · Azure Policy & Governance lab · Per-workload operational runbooks · SLO/SLI monitoring walkthrough · DR tabletop exercise + live rehearsal · FinOps dashboard handover · Client independence witnessed exercise.
Most migration frameworks focus on the destination — what Azure services to use. This one focuses on the delivery system — how to produce consistent, auditable, repeatable outcomes across every wave, regardless of which engineer is running it.
The reusable Terraform module library, the shared CI/CD pipeline template, the policy initiative deployed on Day 1, and the scored wave inventory all serve one goal: make the second wave as reliable as the first, and the tenth wave as reliable as the second.
The scripts and Terraform patterns in this programme are not one-offs. Every module is versioned, README-generated, and consumed by reference — not copy-pasted. Every pipeline run produces an audit trail from commit to deployed infrastructure. Every policy violation is caught before it reaches PROD. That is what enterprise-scale migration engineering looks like.
The complete reference document covers all six phases, 22+ scripts and Terraform modules, RACI matrices, full CI/CD YAML templates, the 13-sprint delivery plan, and handover checklists.