How to Resolve Azure Site Recovery Replication and Failover Errors

Understanding Azure Site Recovery

Azure Site Recovery (ASR) provides disaster recovery by replicating VMs to a secondary Azure region. Replication and failover failures leave your DR strategy broken, potentially leaving you unprotected during an actual disaster. This guide covers every common ASR error and its resolution.

Understanding the Root Cause

Resolving Azure Site Recovery Replication and Failover requires more than applying a quick fix to suppress error messages. The underlying cause typically involves a mismatch between your application’s expectations and the service’s actual behavior or limits. Azure services enforce quotas, rate limits, and configuration constraints that are documented but often overlooked during initial development when traffic volumes are low and edge cases are rare.

When this issue appears in production, it usually indicates that the system has crossed a threshold that was not accounted for during capacity planning. This could be a throughput limit, a connection pool ceiling, a timeout boundary, or a resource quota. The error messages from Azure services are designed to be actionable, but they sometimes point to symptoms rather than the root cause. For example, a timeout error might actually be caused by a DNS resolution delay, a TLS handshake failure, or a downstream dependency that is itself throttled.

The resolution strategies in this guide are organized from least invasive to most invasive. Start with configuration adjustments that do not require code changes or redeployment. If those are insufficient, proceed to application-level changes such as retry policies, connection management, and request patterns. Only escalate to architectural changes like partitioning, sharding, or service tier upgrades when the simpler approaches cannot meet your requirements.

Impact Assessment

Before implementing any resolution, assess the blast radius of the current issue. Determine how many users, transactions, or dependent services are affected. Check whether the issue is intermittent or persistent, as this distinction changes the urgency and approach. Intermittent issues often indicate resource contention or throttling near a limit, while persistent failures typically point to misconfiguration or a hard limit being exceeded.

Review your Service Level Objectives (SLOs) to understand the business impact. If your composite SLA depends on this service’s availability, calculate the actual downtime or degradation window. This information is critical for incident prioritization and for justifying the engineering investment required for a permanent fix versus a temporary workaround.

Consider the cascading effects on downstream services and consumers. When Azure Site Recovery Replication and Failover degrades, every service that depends on it may also experience failures or increased latency. Map out your service dependency graph to understand the full impact scope and prioritize the resolution accordingly.

Common Error Codes

Error Code Description Cause
150097 Unable to install Mobility Service Firewall blocking, missing prerequisites
151066 A]SR agent communication failure Network connectivity issues
151037 Failed to register with ASR provider Proxy or DNS configuration
151196 Replication could not be enabled Storage account or cache issues
151197 Virtual network mapping failed Target VNet not found or misconfigured
150172 Failed to enable replication Disk configuration not supported
151025 Agent connectivity check failed NSG rules blocking outbound

Error 150097: Mobility Service Installation Failed

# Verify prerequisites on the source VM (Linux)
# Check OS version is supported
cat /etc/os-release

# Check free space (needs 15 GB minimum for install)
df -h /

# Ensure required ports are open
netstat -tlnp | grep -E '443|9443'

# Install prerequisites for Linux
sudo apt-get install -y linux-headers-$(uname -r) gcc make perl

# Check if Mobility Service is installed
systemctl status svagents
systemctl status InMage*
# Windows VM prerequisites check
# Check .NET Framework version (4.6+ required)
Get-ChildItem 'HKLM:\SOFTWARE\Microsoft\NET Framework Setup\NDP\v4\Full' | 
    Get-ItemPropertyValue -Name Release

# Enable File and Printer Sharing
netsh advfirewall firewall set rule group="File and Printer Sharing" new enable=Yes

# Enable WMI
netsh advfirewall firewall set rule group="Windows Management Instrumentation (WMI)" new enable=Yes

Error 151025: NSG Network Rules

# Required service tags for outbound NSG rules
az network nsg rule create \
  --resource-group myRG \
  --nsg-name myNSG \
  --name AllowASR \
  --priority 100 \
  --direction Outbound \
  --access Allow \
  --protocol Tcp \
  --destination-port-ranges 443 \
  --destination-address-prefixes "Storage" "AzureSiteRecovery" "AzureActiveDirectory" "EventHub"

# Verify outbound connectivity from the VM
# To Azure Storage
curl -v https://cache1.blob.core.windows.net 2>&1 | head -5

# To Site Recovery
curl -v https://myVault.siterecovery.windowsazure.com 2>&1 | head -5

Error 151037: Proxy Configuration

# Linux: Configure proxy for Mobility Service
cat > /etc/environment << 'EOF'
http_proxy=http://proxy.example.com:8080
https_proxy=http://proxy.example.com:8080
no_proxy=169.254.169.254,168.63.129.16
EOF

# Update Mobility Service proxy config
cat > /var/lib/svagents/InMage/etc/drscout.conf << 'EOF'
ProxyMode=0
ProxyHost=proxy.example.com
ProxyPort=8080
EOF

# Restart the service
systemctl restart svagents
# Windows: Configure proxy for Mobility Service
$RegPath = "HKLM:\SOFTWARE\Microsoft\Azure Site Recovery\ProxySettings"
New-ItemProperty -Path $RegPath -Name "ProxyAddress" -Value "http://proxy.example.com:8080" -PropertyType String -Force
Restart-Service InMage*

Enabling Replication

# Create a Recovery Services vault
az backup vault create \
  --name myRecoveryVault \
  --resource-group myRG \
  --location eastus2

# Enable replication for a VM
az site-recovery replication-protected-item create \
  --fabric-name "azure-eastus" \
  --protection-container "asr-a2a-default-eastus-container" \
  --resource-group myRG \
  --vault-name myRecoveryVault \
  --name myVM-replication \
  --policy-id "/subscriptions/{subId}/resourceGroups/myRG/providers/Microsoft.RecoveryServices/vaults/myRecoveryVault/replicationPolicies/a2a-policy" \
  --provider-specific-details '{
    "instanceType": "A2A",
    "fabricObjectId": "/subscriptions/{subId}/resourceGroups/myRG/providers/Microsoft.Compute/virtualMachines/myVM",
    "recoveryContainerId": "/subscriptions/{subId}/resourceGroups/myRG-dr/providers/Microsoft.RecoveryServices/vaults/myRecoveryVault/replicationFabrics/azure-eastus2/replicationProtectionContainers/asr-a2a-default-eastus2-container",
    "recoveryResourceGroupId": "/subscriptions/{subId}/resourceGroups/myRG-dr"
  }'

Resilience Patterns for Long-Term Prevention

Once you resolve the immediate issue, invest in resilience patterns that prevent recurrence. Azure's cloud-native services provide building blocks for resilient architectures, but you must deliberately design your application to use them effectively.

Retry with Exponential Backoff: Transient failures are expected in distributed systems. Your application should automatically retry failed operations with increasing delays between attempts. The Azure SDK client libraries implement retry policies by default, but you may need to tune the parameters for your specific workload. Set maximum retry counts to prevent infinite retry loops, and implement jitter (randomized delay) to prevent thundering herd problems when many clients retry simultaneously.

Circuit Breaker Pattern: When a dependency consistently fails, continuing to send requests increases load on an already stressed service and delays recovery. Implement circuit breakers that stop forwarding requests after a configurable failure threshold, wait for a cooldown period, then tentatively send a single test request. If the test succeeds, the circuit closes and normal traffic resumes. If it fails, the circuit remains open. Azure API Management provides a built-in circuit breaker policy for backend services.

Bulkhead Isolation: Separate critical and non-critical workloads into different resource instances, connection pools, or service tiers. If a batch processing job triggers throttling or resource exhaustion, it should not impact the real-time API serving interactive users. Use separate Azure resource instances for workloads with different priority levels and different failure tolerance thresholds.

Queue-Based Load Leveling: When the incoming request rate exceeds what the backend can handle, use a message queue (Azure Service Bus or Azure Queue Storage) to absorb the burst. Workers process messages from the queue at the backend's sustainable rate. This pattern is particularly effective for resolving throughput-related issues because it decouples the rate at which requests arrive from the rate at which they are processed.

Cache-Aside Pattern: For read-heavy workloads, cache frequently accessed data using Azure Cache for Redis to reduce the load on the primary data store. This is especially effective when the resolution involves reducing request rates to a service with strict throughput limits. Even a short cache TTL of 30 to 60 seconds can dramatically reduce the number of requests that reach the backend during traffic spikes.

Error 151196: Cache Storage Account Issues

# ASR uses a cache storage account in the source region
# Check if the storage account exists and is accessible
az storage account show \
  --name asrcache123 \
  --resource-group myRG \
  --query "{status:statusOfPrimary, kind:kind, sku:sku.name}" -o json

# The cache storage account must:
# - Be in the SAME region as the source VM
# - Be General Purpose v1 or v2
# - NOT have firewall rules blocking ASR service
# - NOT use private endpoints (or must allow ASR service)

# Check storage account firewall
az storage account show \
  --name asrcache123 \
  --resource-group myRG \
  --query "networkRuleSet" -o json

# Allow ASR service access
az storage account update \
  --name asrcache123 \
  --resource-group myRG \
  --bypass "AzureServices"

Test Failover

# Run a test failover (does not affect production)
az site-recovery replication-protected-item test-failover \
  --fabric-name "azure-eastus" \
  --protection-container "asr-a2a-default-eastus-container" \
  --resource-group myRG \
  --vault-name myRecoveryVault \
  --name myVM-replication \
  --failover-direction PrimaryToRecovery \
  --network-id "/subscriptions/{subId}/resourceGroups/myRG-dr/providers/Microsoft.Network/virtualNetworks/dr-vnet"

# Check test failover status
az site-recovery replication-protected-item show \
  --fabric-name "azure-eastus" \
  --protection-container "asr-a2a-default-eastus-container" \
  --resource-group myRG \
  --vault-name myRecoveryVault \
  --name myVM-replication \
  --query "properties.testFailoverState" -o tsv

# Cleanup test failover
az site-recovery replication-protected-item test-failover-cleanup \
  --fabric-name "azure-eastus" \
  --protection-container "asr-a2a-default-eastus-container" \
  --resource-group myRG \
  --vault-name myRecoveryVault \
  --name myVM-replication

Replication Health Monitoring

# Check replication health
az site-recovery replication-protected-item list \
  --fabric-name "azure-eastus" \
  --protection-container "asr-a2a-default-eastus-container" \
  --resource-group myRG \
  --vault-name myRecoveryVault \
  --query "[].{name:name, health:properties.replicationHealth, state:properties.protectionState, rpo:properties.lastRpoCalculatedTime}" \
  -o table

Understanding Azure Service Limits and Quotas

Every Azure service operates within defined limits and quotas that govern the maximum throughput, connection count, request rate, and resource capacity available to your subscription. These limits exist to protect the multi-tenant platform from noisy-neighbor effects and to ensure fair resource allocation across all customers. When your workload approaches or exceeds these limits, the service enforces them through throttling (HTTP 429 responses), request rejection, or degraded performance.

Azure service limits fall into two categories: soft limits that can be increased through a support request, and hard limits that represent fundamental architectural constraints of the service. Before designing your architecture, review the published limits for every Azure service in your solution. Plan for the worst case: what happens when you hit the limit during a traffic spike? Your application should handle throttled responses gracefully rather than failing catastrophically.

Use Azure Monitor to track your current utilization as a percentage of your quota limits. Create dashboards that show utilization trends over time and set alerts at 70 percent and 90 percent of your limits. When you approach a soft limit, submit a quota increase request proactively rather than waiting for a production incident. Microsoft typically processes quota increase requests within a few business days, but during high-demand periods it may take longer.

For services that support multiple tiers or SKUs, evaluate whether upgrading to a higher tier provides the headroom you need. Compare the cost of the upgrade against the cost of engineering effort to work around the current limits. Sometimes, paying for a higher service tier is more cost-effective than building complex application-level sharding, caching, or load-balancing logic to stay within the lower tier's constraints.

Disaster Recovery and Business Continuity

When resolving service issues, consider the broader disaster recovery and business continuity implications. If Azure Site Recovery Replication and Failover is a critical dependency, your Recovery Time Objective (RTO) and Recovery Point Objective (RPO) determine how quickly you need to restore service and how much data loss is acceptable.

Implement a multi-region deployment strategy for business-critical services. Azure paired regions provide automatic data replication and prioritized recovery during regional outages. Configure your application to failover to the secondary region when the primary region is unavailable. Test your failover procedures regularly to ensure they work correctly and meet your RTO targets.

Maintain infrastructure-as-code templates for all your Azure resources so you can redeploy your entire environment in a new region if necessary. Store these templates in a geographically redundant source code repository. Document the manual steps required to complete a region failover, including DNS changes, connection string updates, and data synchronization verification.

Recovery Plan

# Create a recovery plan for coordinated failover of multiple VMs
az site-recovery recovery-plan create \
  --resource-group myRG \
  --vault-name myRecoveryVault \
  --name myRecoveryPlan \
  --primary-fabric-id "/subscriptions/{subId}/resourceGroups/myRG/providers/Microsoft.RecoveryServices/vaults/myRecoveryVault/replicationFabrics/azure-eastus" \
  --recovery-fabric-id "/subscriptions/{subId}/resourceGroups/myRG/providers/Microsoft.RecoveryServices/vaults/myRecoveryVault/replicationFabrics/azure-eastus2" \
  --groups '[{"groupType":"Boot","replicationProtectedItems":[{"id":"item1-id"},{"id":"item2-id"}]}]'

Disk Configuration Errors (150172)

  • Ultra disks — Not supported for ASR replication
  • Shared disks — Not supported
  • Ephemeral OS disks — Not supported
  • Disk size > 32 TB — Not supported
  • Disk churn > 54 MB/s per disk — May cause replication lag; use Premium managed disks

Capacity Planning and Forecasting

The most effective resolution is preventing the issue from recurring through proactive capacity planning. Establish a regular review cadence where you analyze growth trends in your service utilization metrics and project when you will approach limits.

Use Azure Monitor metrics to track the key capacity indicators for Azure Site Recovery Replication and Failover over time. Create a capacity planning workbook that shows current utilization as a percentage of your provisioned limits, the growth rate over the past 30, 60, and 90 days, and projected dates when you will reach 80 percent and 100 percent of capacity. Share this workbook with your engineering leadership to support proactive scaling decisions.

Factor in planned events that will drive usage spikes. Product launches, marketing campaigns, seasonal traffic patterns, and batch processing schedules all create predictable demand increases that should be accounted for in your capacity plan. If your application serves a global audience, consider time-zone-based traffic distribution and scale accordingly.

Implement autoscaling where the service supports it. Azure autoscale rules can automatically adjust capacity based on real-time metrics. Configure scale-out rules that trigger before you reach limits (at 70 percent utilization) and scale-in rules that safely reduce capacity during low-traffic periods to optimize costs. Test your autoscale rules under load to verify that they respond quickly enough to protect against sudden traffic spikes.

Summary

ASR failures primarily stem from network connectivity (NSG rules missing required service tags for Storage, AzureSiteRecovery, and AAD), Mobility Service installation prerequisites (ports, disk space, OS compatibility), cache storage account configuration (must be same region, allow Azure services), and unsupported disk types. Always run a test failover after setting up replication, monitor replication health for RPO drift, and ensure outbound connectivity from source VMs to Azure service endpoints on port 443.

For more details, refer to the official documentation: About Site Recovery, Troubleshoot Azure-to-Azure VM replication errors.

Leave a Reply