Failover WAN caused incoming SIP calls to fail


#1

We’re using UCM6208, FW version 1.0.19.27. Our primary router (Unifi USG) is setup with primary WAN and failover WAN. We’ve had two instances where primary WAN failed and failover WAN kicked in. Expected behavior from the UCM6208 would be no SIP traffic since our SIP provider and the External Host in the UCM is set to our static IP for the primary WAN.

However, we have an issue where once the primary WAN is restored, our incoming SIP traffic continues trying to go to the failover WAN IP - it’s “stuck” on the failover WAN IP. As a result, incoming SIP calls drop after 32 seconds, which is the magic number for NAT issues. Outgoing calls work flawlessly.

I’ve extensively reviewed our SIP NAT settings, Unifi USG port forwarding, etc. with no luck.

Some important details:

  • External Host in SIP > NAT settings is set as primary WAN static IP
  • SDP setting underneath is disabled (enabling did not resolve the issue)
  • Local Network Address is correctly set up
  • SIP trunks are register trunks, NAT checkbox is disabled in SIP trunk config (enabling caused all SIP traffic to fail)
  • The first time failover WAN kicked in and this issue happened, I ended up upgrading the UCM firmware and the issue resolved. Now I’m on the current FW version. Rebooting the UCM has not helped.

Is there a way to force the UCM to go back to the primary WAN for incoming SIP calls? I’ve searched this forum and have found hints of what to do (“32 seconds” was interesting), but I’m not sure what my next step needs to be. Thanks for your help.


#2

It may have to do with the registration period.

When the UCM registers, VOIP.MS (V), sees the IP and port of register and will use this to respond to (for a bit). I suspect, that the UCM registered while the secondary WAN was in play. V saw the registration attempt, sent back the challenge to the IP and port seen from the registration, the UCM responded (again using the secondary WAN) and V accepted it and continued to use it.

Then, when a call came into V, the call was sent to the IP and port it last saw on the register, which I assume the USG still passed, the INVITE was answered by the UCM with a 100 Trying, a 180 Ringing and when a phone answered the UCM sent a 200OK/SDP which then told V to use the contact and connect IP as seen in the UCM’s external host setting rather than the IP seen at the register. V did so and sent an ACK to the 200OK, but the IP was wrong and never seen so V sent a number of of additional ACKs and when T1 timed out, Timer B kicked in at the UCM and it dropped the call…this is at the 32 second mark. The UCM never saw the ACK to its 200OK to V.

I assume the UCM is in switch mode? If so, then the UCM is not the one at fault. It has no awareness of the WAN as all traffic is delivered internally to its LAN port behind the firewall. It only knows what is local and what is external (by virtue of the NAT settings which I assume are correct) and this is how it formulates the messages and what messages are external travels thru the gateway set in the network settings. PS- you should have the SDP setting enabled under external host. The SDP is for the audio/video/fax only as the SIP is what sets-up and breaks down the call.

When all is operating correctly and on the primary WAN, what happens if any SIP traffic hits the secondary WAN IP? Does the USG pass it, reject it, block it?

If you want to get around the issue to some degree, then get a FQDN from a DDNS service such as DYN.com. Use a service that the USG supports as some routers have a set of DDNS providers built-in to the device and these may be the only ones that you can use. Then set up the router for DDNS using that service and in the UCM change the external host from the static IP to the new FQDN. By setting DDNS in the router, the router will know and be the first to advise of any IP/WAN change back to the DDNS provider who will then relay the change to the DNS servers around the globe. The propagation of such will take some time, so there may be some down time until this occurs for whatever DNS V uses, so be patient.

Examine the registration settings in the UCM -

When the UCM sends a REGISTER request to V, it will include an expires header in the request. This will be the default. When V receives the request, it will examine the header and determine if the requested time is acceptable or not. In the response to the INVITE, V will also include an expires header and if acceptable it will match the default and if not, they will input their own expiry period, which the UCM is obligated to accept. However, the UCM will attempt to refresh the registration when half of the expiry period has elapsed regardless of the negotiated time.

As you can see in my settings, I have requested an expires time of 120 seconds. The providers I use accept this in their response. The UCM refreshes the registration every 60 seconds as a result, which is also fine with my providers.

You may want to use wireshark and examine the REGISTER request and the 200OK response from V to see exactly what the period negotiated is and possibly adjust. Just don’t make the default period too small as the provider will either override it with their own expires or possibly blacklist your IP for too many request in too short a period.

If your ISP is subject to frequent short period issues and the USG failover comes into play often and for this, then going the DDNS route may not be worth it. The amount of time needed to propagate the new IP around may not be long enough for the FQDN to resolve to the WAN IP in use (it also has to propagate when going back to the primary).

I have this implemented in a couple of places where my clients have the luxury of two different ISPs; most do not.

Good luck.


#3

Thanks so much for the reply @lpneblett! That was very helpful. To get us operational again, we ended up doing a factory reset and restored a backup, and the issue resolved. But I’m going to hang onto this info to try to implement DDNS down the road.