Azure Migration Engineering

Engineering a Large-Scale Azure Migration:
From Discovery to Production in 13 Sprints

A practitioner's blueprint covering assessment automation, Landing Zone IaC, CI/CD pipeline design, governance guardrails, cutover orchestration, and full operationalisation — with real scripts and Terraform patterns you can use today.

Shartul Kumar Senior Azure Migration Architect · Azure Migrate · Terraform · DevOps · Platform Engineering

6Phases

13Sprints

22+Scripts & Modules

500+Workload Scale

Azure Migrate Terraform Azure DevOps Landing Zone PowerShell Governance CAF GitOps SRE FinOps CI/CD IaC

Migrating hundreds of enterprise workloads to Azure is not a lift-and-shift exercise — it is a disciplined engineering programme. This post distils a proven six-phase delivery framework, complete with automation scripts, reusable Terraform patterns, CI/CD pipeline templates, and governance guardrails, so your team can replicate it at scale.

The Six-Phase Programme at a Glance

Each phase is time-boxed, produces version-controlled artefacts in Azure Repos, and gates the next phase on signed acceptance criteria. The full programme spans 24–36 weeks for a 100–500 workload migration.

Phase 1

Discover & Assess

Weeks 1–4 · S1–S2

Phase 2

Design & Landing Zone

Weeks 5–10 · S3–S5

Phase 3

Build & Automate

Weeks 11–16 · S6–S8

Phase 4

Migrate & Execute

Weeks 17–24 · S9–S12

Phase 5

Govern & Secure

Ongoing · S3–S12

Phase 6

Operate & Optimise

Post go-live · S13+

100%Policy as Code

40%reductionManual effort

0HIGH / CRITtfsec gate

6stage gatesPer pipeline

Phase 1 — Automating Discovery & Assessment

The first two sprints are entirely about data. You cannot design a migration without an accurate, scored, dependency-aware inventory. Azure Migrate provides the platform; PowerShell and Python provide the automation layer on top of it.

Deploy the Appliance Programmatically

Instead of clicking through the portal, deploy the Azure Migrate project and register all required resource providers in a single repeatable PowerShell script. This ensures every programme environment — DEV, QA, PROD — is configured identically.

PowerShell P1-01-deploy-migrate-appliance.ps1

# Create Azure Migrate project + register required resource providers
param(
  [Parameter(Mandatory)] [string]$ResourceGroupName,
  [Parameter(Mandatory)] [string]$ProjectName,
  [Parameter(Mandatory)] [string]$Location,
  [Parameter(Mandatory)] [string]$SubscriptionId
)

Connect-AzAccount -TenantId $env:ARM_TENANT_ID
Set-AzContext -SubscriptionId $SubscriptionId
New-AzResourceGroup -Name $ResourceGroupName -Location $Location -Force

# Register all providers needed for Azure Migrate
$providers = @('Microsoft.Migrate', 'Microsoft.OffAzure',
               'Microsoft.DataMigration', 'Microsoft.HybridCompute')
$providers | ForEach-Object {
  Register-AzResourceProvider -ProviderNamespace $_ | Out-Null
}

# Create the Azure Migrate project
$resourceId = "/subscriptions/$SubscriptionId/resourceGroups/" +
               "$ResourceGroupName/providers/Microsoft.Migrate/" +
               "MigrateProjects/$ProjectName"

New-AzResource -ResourceId $resourceId `
  -ApiVersion "2023-06-06" -Properties @{} `
  -Location $Location -Force

Write-Host "Project ready. Deploy OVA and register appliance via portal." -ForegroundColor Green

Score & Wave-Assign Inventory Automatically

After 30 days of observation, the Python export script calls the Azure Migrate REST API, scores each server by CPU cores, memory, and disk count, and auto-assigns Wave 1 / 2 / 3 — the bottom 40% (simplest) go first, the most complex 20% go last.

Python P1-02-export-inventory.py

def score_workload(server: dict) -> dict:
    props = server.get('properties', {})
    cores = props.get('numberOfProcessorCore', 0)
    mem   = props.get('allocatedMemoryInMB', 0) / 1024
    disks = len(props.get('disks', {}))

    # Weighted complexity — higher = migrate later
    score = (cores * 2) + (mem * 0.5) + (disks * 3)
    tier  = ('T1-Critical' if score > 40
             else 'T2-Standard' if score > 20
             else 'T3-Simple')
    return {'ComplexityScore': round(score, 1), 'SuggestedTier': tier, ...}

# Auto wave assignment: bottom 40% → Wave 1, mid 40% → Wave 2, top 20% → Wave 3
total = len(scored)
for i, s in enumerate(scored):
    s['Wave'] = ('Wave-1' if i < total * 0.4
                 else 'Wave-2' if i < total * 0.8
                 else 'Wave-3')

Key principle: data before design

Never enter Phase 2 without a signed-off wave plan. Every Landing Zone design decision — address space, firewall rules, subnet sizing — is shaped by the workload inventory produced in Phase 1.

Phase 2 — Landing Zone as Code

The Landing Zone is the foundation every migrated workload lands on. It must be deployed before a single VM is replicated, and it must be entirely defined in Terraform — no manual portal steps, no snowflake configuration.

Hub-Spoke Network Topology

The hub VNet hosts Azure Firewall Premium, ExpressRoute/VPN Gateway, Bastion, and the Private DNS Resolver. Spoke VNets are created per workload subscription using for_each, each peered to the hub with gateway transit enabled. All traffic between spokes flows through the hub firewall — no direct spoke-to-spoke paths.

HCL — Terraform modules/networking/hub-spoke/main.tf

# Spoke VNets — one per workload subscription, created with for_each
resource "azurerm_virtual_network" "spoke" {
  for_each            = var.spokes
  name                = "vnet-${each.key}-${var.environment}-${var.location_short}"
  resource_group_name = each.value.resource_group_name
  address_space       = [each.value.address_space]
  dns_servers         = var.custom_dns_servers
  tags                = local.common_tags
}

# Hub → Spoke peering with gateway transit
resource "azurerm_virtual_network_peering" "hub_to_spoke" {
  for_each                  = var.spokes
  name                      = "peer-hub-to-${each.key}"
  virtual_network_name      = azurerm_virtual_network.hub.name
  remote_virtual_network_id = azurerm_virtual_network.spoke[each.key].id
  allow_forwarded_traffic   = true
  allow_gateway_transit     = true
}

# Spoke → Hub peering with remote gateway use
resource "azurerm_virtual_network_peering" "spoke_to_hub" {
  for_each                  = var.spokes
  name                      = "peer-${each.key}-to-hub"
  virtual_network_name      = azurerm_virtual_network.spoke[each.key].name
  remote_virtual_network_id = azurerm_virtual_network.hub.id
  use_remote_gateways       = true
}

Terraform Remote State — First Act

Before any terraform init, the PowerShell backend script creates a GRS storage account with soft-delete, blob versioning, and a CanNotDelete resource lock — preventing accidental state loss for the duration of the programme. The backend configuration block is printed to the terminal ready to paste into backend.tf.

Phase 3 — The Reusable CI/CD Pipeline

Every Terraform repository in the programme — Landing Zone, pattern library, governance — consumes the same reusable Azure DevOps YAML template stored in pipelines-templates/terraform-cicd.yml. This single template enforces six mandatory stages on every PR merge:

Stage 1 — Lint & Format: terraform fmt --check and tflint — zero errors required to pass.
Stage 2 — Security Scan: tfsec and Checkov — any HIGH or CRITICAL finding immediately fails the build.
Stage 3 — Terraform Plan: Authenticated plan stored as a pipeline artefact; diff published for reviewer inspection.
Stage 4 — Manual Approval Gate: Migration Engineer + Architect must approve the plan diff before apply proceeds. 3-day window before auto-reject.
Stage 5 — Terraform Apply: Uses the approved plan artefact exactly — no re-plan, no drift from what was reviewed.
Stage 6 — Post-Apply Tests: Python validation script checks VNet topology, firewall provisioning, peering status, and Azure Policy compliance — exits non-zero on any failure.

Never allow `terraform apply -auto-approve` on PROD

The approval gate exists for a reason. Even if the plan looks identical to DEV, a reviewer's eyes catch destructive changes that automation misses — particularly when resource dependencies shift between migration waves.

Key Toolchain

Terraform + Azure RM Provider

Primary IaC engine. Version pinned in versions.tf. tfenv manages version switching across environments.

tfsec + Checkov

Security misconfiguration and CIS benchmark compliance. Hard-fail gates on HIGH/CRITICAL findings.

Azure DevOps Pipelines

Multi-stage YAML pipelines. Single reusable template consumed by all infra repos via resource reference.

GitOps — PR-first Workflow

No direct commits to main. Every infra change is a reviewed PR with Terraform plan output posted as a comment.

Terratest

Go-based integration testing for Terraform modules. Runs post-apply in isolated test subscriptions.

terraform-docs

Auto-generates module README from variable and output declarations. Runs as a CI step on every merge to main.

Phase 4 — Migration Patterns & Cutover Orchestration

Every workload from the wave plan is assigned one of six migration patterns before Sprint 9. The pattern determines which Terraform module, runbook, and cutover script applies — there are no ad-hoc decisions on cutover night.

Pattern	Use Case	IaC Approach	Speed
Rehost Lift & Shift	Migrate VM as-is via ASR replication	ASR Terraform module + DNS cutover script	Fastest
Replatform	Minor OS or DB optimisation during migration	Terraform + Ansible post-config playbook	Fast
Refactor	Containerise to AKS or move to PaaS	Terraform PaaS modules + Helm charts	Medium
Re-purchase	Replace on-premises app with SaaS (M365, Dynamics)	Azure Marketplace + ARM linked templates	Medium
Retire	Decommission — no migration path exists	terraform destroy pipeline + data archive	N/A
Retain	Regulatory or latency constraint prevents migration	ExpressRoute + Azure Arc hybrid config	On-prem

Cutover Orchestration — The Exact Sequence

The cutover PowerShell script follows a strict sequence for every wave. Pre-flight checks validate replication health and lag (must be under 5 minutes). The application is quiesced. ASR planned failover is triggered per VM with a wait loop. Private DNS records are then cut over to target VM private IPs. Finally, the Python smoke test suite runs — if any test fails three times, the rollback script fires ASR failback automatically.

PowerShell P4-01-wave-cutover.ps1 (excerpt)

# Pre-flight: validate replication health and lag for each VM
foreach ($vmName in $VMNames) {
  $item = Get-AzRecoveryServicesAsrReplicationProtectedItem |
           Where-Object { $_.FriendlyName -eq $vmName }
  if ($item.ReplicationHealth -ne "Normal") {
    throw "$vmName replication health: $($item.ReplicationHealth)"
  }
  if ($item.RecoveryPointObjective -gt 300) {
    Write-Warning "$vmName lag $($item.RecoveryPointObjective)s exceeds 5min"
  }
}

# Trigger planned failover and wait for completion
foreach ($vmName in $VMNames) {
  Start-AzRecoveryServicesAsrPlannedFailoverJob `
    -ReplicationProtectedItem $item -Direction PrimaryToRecovery | Out-Null
}

# DNS cutover — update Private DNS A record to new target IP
foreach ($vmName in $VMNames) {
  $privateIp = (Get-AzNetworkInterface -ResourceId $vm.NetworkProfile.NetworkInterfaces[0].Id `
    ).IpConfigurations[0].PrivateIpAddress
  New-AzPrivateDnsRecordSet -ZoneName $DnsZoneName -Name $vmName `
    -RecordType A -Ttl 300 `
    -PrivateDnsRecords (New-AzPrivateDnsRecordConfig -IPv4Address $privateIp)
}

# Run smoke test suite — rollback triggered automatically on failure
python smoke_tests/run_wave_smoke_tests.py --vms ($VMNames -join ',')

Run test failovers 2 weeks before production cutover

Use ASR's test failover into an isolated network. Validate the application fully, then clean up test VMs. Any DNS or connectivity issues found here are far cheaper to fix than on cutover night.

Phase 5 — Governance Baked In, Not Bolted On

Governance is deployed on Day 1 of Sprint 3 — before any workload is migrated. Azure Policy is the enforcement layer; Terraform is the delivery mechanism. Every migration wave subscription receives the same mandatory policy initiative at assignment time.

The Mandatory Policy Initiative — What It Enforces

Allowed locations — Deny resources outside approved Azure regions.
Mandatory tagging — Deny deployments missing Environment, WorkloadName, CostCentre, Owner, MigrationWave.
Diagnostic settings — DeployIfNotExists pushes all resource logs to central Log Analytics.
No public IPs on VMs — Deny effect. All compute is privately connected.
HTTPS-only storage — Deny plaintext. TLS 1.2 minimum enforced.
Defender for Cloud — DeployIfNotExists enables Standard tier across all resource types.

HCL — Terraform modules/governance/policy-initiative/main.tf (excerpt)

resource "azurerm_policy_set_definition" "migration_governance" {
  name                = "migration-governance-initiative"
  policy_type         = "Custom"
  display_name        = "Azure Migration Governance Initiative"
  management_group_id = var.management_group_id

  # Allowed locations — Deny effect
  policy_definition_reference {
    policy_definition_id = "/providers/Microsoft.Authorization/policyDefinitions/e56962a6-..."
    parameter_values = jsonencode({
      listOfAllowedLocations = { value = "[parameters('allowedLocations')]" }
    })
  }

  # Diagnostic settings — DeployIfNotExists
  policy_definition_reference {
    policy_definition_id = azurerm_policy_definition.diag_settings.id
    parameter_values = jsonencode({
      workspaceId = { value = "[parameters('logAnalyticsWorkspaceId')]" }
    })
  }
}

# Assign the initiative at Management Group scope
resource "azurerm_management_group_policy_assignment" "migration_governance" {
  name                 = "migration-governance"
  management_group_id  = var.management_group_id
  policy_definition_id = azurerm_policy_set_definition.migration_governance.id
  location             = var.location
  identity { type = "SystemAssigned" }
}

FinOps from Sprint 6

Budget alerts at 80% (warning) and 100% (critical) are deployed via Terraform per subscription. A weekly PowerShell tag compliance report scores every resource group against the mandatory tag taxonomy and emails the result to the cloud governance team — non-compliant resources cannot slip through unnoticed.

Phase 6 — Operationalising for the Long Run

A migration that hands over unmonitored, undocumented workloads is not a successful migration. Phase 6 establishes the SRE framework, DR automation pipeline, and knowledge transfer programme that make the client team genuinely self-sufficient.

SLO/SLI Framework — Error Budget Fast-Burn Alert

Azure Monitor scheduled query rules enforce SLOs via KQL. The fast-burn alert fires when the error budget is burning 14× faster than the 30-day allowance — the "page someone immediately" signal, deployed as Terraform for every Tier 1 workload:

KQL — Azure Monitor Alert Rule Error Budget Fast-Burn (14× rate) — Tier 1 workloads

// Fast-burn: error budget consuming 14× faster than 30d SLO allows
let errorBudgetPct = 1 - 0.9995;   // Tier 1 SLO: 99.95% availability
let burnMultiple   = 14.0;          // 14× = page immediately
AppRequests
| where AppRoleName == 'your-workload-name'
| summarize
    total  = count(),
    failed = countif(Success == false)
  by bin(TimeGenerated, 5m)
| extend errorRate = todouble(failed) / todouble(total)
| where errorRate > errorBudgetPct * burnMultiple

Automated Quarterly DR Tests

A scheduled Azure DevOps pipeline runs on the 1st of every third month at 02:00 UTC. It triggers an ASR test failover into an isolated network, runs the full smoke test suite against the DR VMs, and then cleans up the test environment automatically — all without human intervention. Results publish to the Azure DevOps test reporting dashboard.

YAML — Azure DevOps pipelines-templates/dr-test-pipeline.yml (schedule)

# Automated DR test — runs quarterly, no human trigger required
schedules:
  - cron: '0 2 1 */3 *'          # 02:00 UTC on 1st of every 3rd month
    displayName: Quarterly DR Test
    branches: { include: [main] }
    always: true

stages:
  - stage: TestFailover
    jobs:
      - deployment: dr_test
        environment: dr-test            # isolated environment in Azure DevOps
        strategy:
          runOnce:
            deploy:
              steps:
                - script: |
                    python dr_tests/run_dr_test.py \
                      --workload '$(WORKLOAD_NAME)' \
                      --vault    '$(VAULT_NAME)' \
                      --network  '$(ISOLATED_VNET_ID)'

  - stage: Cleanup
    condition: always()          # clean up even if tests fail
    jobs:
      - job: cleanup_dr_vms
        steps:
          - script: |
              # Removes test failover VMs from the isolated network
              Get-AzRecoveryServicesAsrReplicationProtectedItem |
                Where-Object { $_.FailoverRecoveryPointId } |
                ForEach-Object { Start-AzRecoveryServicesAsrTestFailoverCleanupJob -ReplicationProtectedItem $_ }

Knowledge Transfer — The Real Handover Condition

Documentation and workshops are necessary but not sufficient. The programme does not close until a client engineer can independently deploy a new module from the pattern library, run a pipeline to PROD, and respond to a Defender for Cloud alert — all without assistance from the migration team. This is witnessed, signed off, and recorded before the contract ends.

12 KT activities — all must be completed before programme closure

Terraform & IaC workshops · Azure DevOps pipeline patterns · Azure Policy & Governance lab · Per-workload operational runbooks · SLO/SLI monitoring walkthrough · DR tabletop exercise + live rehearsal · FinOps dashboard handover · Client independence witnessed exercise.

What Makes This Framework Different

Most migration frameworks focus on the destination — what Azure services to use. This one focuses on the delivery system — how to produce consistent, auditable, repeatable outcomes across every wave, regardless of which engineer is running it.

The reusable Terraform module library, the shared CI/CD pipeline template, the policy initiative deployed on Day 1, and the scored wave inventory all serve one goal: make the second wave as reliable as the first, and the tenth wave as reliable as the second.

The scripts and Terraform patterns in this programme are not one-offs. Every module is versioned, README-generated, and consumed by reference — not copy-pasted. Every pipeline run produces an audit trail from commit to deployed infrastructure. Every policy violation is caught before it reaches PROD. That is what enterprise-scale migration engineering looks like.

Access the Full Technical Blueprint

The complete reference document covers all six phases, 22+ scripts and Terraform modules, RACI matrices, full CI/CD YAML templates, the 13-sprint delivery plan, and handover checklists.

Download Blueprint (DOCX) View on GitHub Connect on LinkedIn

#AzureMigration #Terraform #AzureDevOps #CloudArchitecture #InfrastructureAsCode #DevOps #MicrosoftAzure #SRE #FinOps #CAF #GitOps #CloudMigration

Engineering a Large-Scale Azure Migration:From Discovery to Production in 13 Sprints

The Six-Phase Programme at a Glance

Phase 1 — Automating Discovery & Assessment

Deploy the Appliance Programmatically

Score & Wave-Assign Inventory Automatically

Key principle: data before design

Phase 2 — Landing Zone as Code

Hub-Spoke Network Topology

Terraform Remote State — First Act

Phase 3 — The Reusable CI/CD Pipeline

Never allow terraform apply -auto-approve on PROD

Key Toolchain

Terraform + Azure RM Provider

tfsec + Checkov

Azure DevOps Pipelines

GitOps — PR-first Workflow

Terratest

terraform-docs

Phase 4 — Migration Patterns & Cutover Orchestration

Cutover Orchestration — The Exact Sequence

Run test failovers 2 weeks before production cutover

Phase 5 — Governance Baked In, Not Bolted On

The Mandatory Policy Initiative — What It Enforces

FinOps from Sprint 6

Phase 6 — Operationalising for the Long Run

SLO/SLI Framework — Error Budget Fast-Burn Alert

Automated Quarterly DR Tests

Knowledge Transfer — The Real Handover Condition

12 KT activities — all must be completed before programme closure

What Makes This Framework Different

Access the Full Technical Blueprint

Continue Reading

Engineering a Large-Scale Azure Migration:
From Discovery to Production in 13 Sprints

Never allow `terraform apply -auto-approve` on PROD