Monday, January 23, 2012

How do hardware interrupts work?

Recently, a friend asked me why we needed device drivers in kernel mode, since the BIOS already had code to deal with some situation (write to a device (int 0x13), deal with VGA (int 0x10)...). Will the question seems simple, it got me thinking a bit.
The first answer which is quite instinctive is performance. But what is the technical explanation? I'm not an expert so what follows might be subject to caution, but this is more or less how I understand it.

On boot up, the processor is in real mode. This means that the IDT (in charge of mapping interrupts to code addresses, a bit like a hash-map) is located in the BIOS address space. Thus the IDT is within the BIOS address space (from 0x0000 to 0x03ff). All constructors have to write the code to deal with the interrupts, and this enable the BIOS to perform some basic tasks (reading the first 512 bytes from a disk to find the MBR for example).
So why can't we leverage these interrupts in protected mode? Well in protected mode, the kernel owns the base IDT address. It puts the base address of the IDT in the IDTR register of the processor (use ASM SIDT instruction on i386)
This means that it controls the mapping from interrupts to code. This also means that the kernel has a way to optimize  and control hardware access. This way it can buffer access to the physical devices to restrain the performance bottleneck for example. It can also implement file systems abstractions (VFS) on top of the basic block devices. But why doesn't it switch back to the int 0x13 BIOS interrupt to have access to the drive?
Well for a few main reasons I think:
  1. It is simply not possible. In real-mode, address space is of 16 bits. You cannot map a higher address space to a lower one. There is no turning back.
  2. Even if 1. was possible, imagine the context switches which would be required (they are expensive). When an application in user-space would want to write to disk (set aside buffering mechanisms), it would have to call fwrite, which in turn would trigger a write system call. This in turn triggers a context change, the arguments in the user land stack are validated and copied on the kernel stack, and kernel deals with the system call. If using BIOS interrupts, it would then have to switch to real mode and issue an int 0x13 interrupt which would be really slow, and only be possible to copy a few bytes of data. This would be a huge bottleneck in disk access.
That's my 2 cents (and I guess naive 2 cents) on why we can't just switch to real mode to deal with disk or VGA access.

Saturday, December 24, 2011

Mitigating TLS DDoS

A few months ago, THC released a PoC code to DoS servers using TLS. The principle is pretty simple and uses the fact that computing crypto material is done on the server side. If you can force the server to re-compute crypto material over and over, you can overrun it and stop it from answering legitimate requests (and eventually crash it). And yes, TLS  does provide a mechanism to just do that: re-negotiation.
What this does is that it forces a re-negotiation/re-generation of the crypto materiel which is much more expensive on the server side. And the pain (or beauty depending on which side you stand) of it, is that this is done inside the already existing TLS session. You do not need to re-negotiate a new TCP session, which should be the good way of doing this. Why? Because that would enable a firewall/device to regulate the number of TCP connections per clients/TLS connections per client in a given time frame... And more generally all the standard arsenal of anti-ddos attack would work (max tcp connections per client, max tcp connections in a given time-frame, volume based threshold, max TLS connections per client, and all you can think of...). But here, re-negotiation happens inside the TLS pipe so nobody sees anything at the network layer. So mitigation can only be done at the application layer. Nice... Oh and yes, there is currently no workaround, only half-crappy mitigations.

Is your server vulnerable?
Try this to see if it is (I just tried a few https servers on the internet):
openssl s_client -connect developer.mozilla.org:443 -tls1CONNECTED(00000003)
depth=1 C = US, O = "GeoTrust, Inc.", CN = GeoTrust SSL CA
verify error:num=20:unable to get local issuer certificate
verify return:0
New, TLSv1/SSLv3, Cipher is RC4-SHA
Server public key is 2048 bit
Secure Renegotiation IS supported
Compression: NONE
Expansion: NONE
SSL-Session:
    Protocol  : TLSv1
    Cipher    : RC4-SHA
    Session-ID: 757DC6F2F77A04555E3E48EAB41F387392A2024D9D3EF9E2B94C3DAC8CCFF8DB
    Session-ID-ctx:
    Master-Key: 7047D3D1FA0E1BFAE5453400A638AC4FA14D21475F93522D02F4F57C7055F4897973B80049D90BD9B803866CD4BF3ADC
    Key-Arg   : None
    PSK identity: None
    PSK identity hint: None
    Start Time: 1324623593
    Timeout   : 7200 (sec)
    Verify return code: 20 (unable to get local issuer certificate)
---
R
RENEGOTIATING

depth=1 C = US, O = "GeoTrust, Inc.", CN = GeoTrust SSL CA
verify error:num=20:unable to get local issuer certificate
verify return:0
R
RENEGOTIATING

depth=1 C = US, O = "GeoTrust, Inc.", CN = GeoTrust SSL CA
verify error:num=20:unable to get local issuer certificate
verify return:0
R
RENEGOTIATING

depth=1 C = US, O = "GeoTrust, Inc.", CN = GeoTrust SSL CA
verify error:num=20:unable to get local issuer certificate
verify return:0

If renegotiation  succeeds (here I manually did it 3 times), then your server is vulnerable.

So what are the mitigations?
  • Application layer
  1.  Disable TLS re-negocaition on the server side
  2. Terminate the TLS connection on a load-balancer and inspect TLS at that level. But maybe the load-blancer will die due to load?
  3. Lower the cipher used on the sever side.
  • Network layer (aka, we manage network infrastructure and can't change anything on the server side)
This is were it is nice. Because you can't do anything, no really. All you can try is to mitigate with semi working methods. But go and ask the guys under a DDoS attack to do some changes on there infrastructure and that from a network perspective you can't do anything...
A few ideas:
  1. Set a per-client number of concurrent connections on the front-end firewall. This will help mitigate the issue once TLS re-negociation is disabled, since the attacker side will have to deploy more clients/connections to achieve the same DDoS result. This is standard mitigation for DDoS attacks, and all configs should have that, right? This implies that you manage to convince the server side team to do some changes.
  2. The idea is to try and drop the TLS re-negociation packet before it reaches the server. The interesting packets are the Encrypted Handshake Message one. This is the packet which triggers re-negociation.
So I tried to focus on number 2. Why is this idea interesting? Because we are going to statistically try and identify the packet triggering re-negotiation from the network layer (statistically is a big word for taking a vague guess).
But first, why do classic threshold/counter based approaches fail in our case? That is because of the nature of the TLS DoS. A bit like Slowloris, it is a low bandwidth consumption attack, thus the attacker does not generate more noise than a standard user.


 Back to our "statistical" solution, let's call it heuristical to show off a bit. We are going to use some of the properties of the TLS header to try and drop the re-negotiation packet. TLS header looks like this
 . I'll add the header picture too for clarity:
So we can match on the fields we know should identify the TLS handshake packet, which are Content-Type (=handshake), version (=SSL3.0 -> TLS1.2), 2 bytes of length which we will just ignore, and last but not least the message-types field.

So this is what we are looking for:
  1. tcp port 443 and tcp flags PUSH/ACK
  2. Look into the tcp payload for the following signature:
  • 0x16 (22) which is the start of a handshake protocol in TLS (Content-Type field)
  • 0x0300 (SSL 3.0) or 0x0301 (TLS 1.0) or 0x0302 (TLS 1.1) or 0x0303 (TLS 1.2) (Version field)
  • Ignore the length field (2 bytes)
  • Look for a Message Type which is undefined. The defined types are (0x00, 0x01, 0x02, 0x0b, 0x0c, 0x0d, 0x0e, 0x0f, 0x10, 0x14)
      3. Simply drop the packet

The important part is last point of 2. because an interesting property is that when re-negociation is triggered, no Message-Type is used. That way we can differentiate legitimate TLS traffic from malicious re-negociations. The pattern we are looking for would be something like (for TLS 1.0, . being an unknown byte, and x a not-bute): 160301..x where x should not be (0x00, 0x01, 0x02, 0x0b, 0x0c, 0x0d, 0x0e, 0x0f, 0x10, 0x14).

So in a nutshell we are going to apply a regex to the payload of the PSH/ACK tcp packets with a destination of port 443, hoping we can match our regex and drop the packet. Who talked about  a performance hog...

Now let's go and apply this to our network device, in my case a good old Guard module. The Guard is already configured with basic settings (rate-limiting, thresholds, conns per client...) but is still letting some DoS packets through because the traffic is not exceeding the threshold.
The way to do this is to create a flex-content-filter on the ADM, which will just hopefully drop the re-negotiation packets. So we need to write a tcpdump regex for this and apply it to our zone. So the nice surprise is that you cannot do a logical "not" in a regexp (or not that I know of). So I had to generate all non-valid Message Type bytes, which gives us this very lengthy filter:
flex-content-filter 1 enabled drop 6 443 expression "tcp[13] = 24" pattern "\x16(\x0300||\x0301||\x0302||\x0303)..?(\x03||\x04||\x05||\x06||\x07||\x08||\x09||\x0a||\x11||\x12||\x13||\x15||\x16||\x17||\x18||\x19||\x1a||\x1b||\x1c||\x1d||\x1e||\x1f||\x20||\x21||\x22||\x23||\x24||\x25||\x26||\x27||\x28||\x29||\x2a||\x2b||\x2c||\x2d||\x2e||\x2f||\x30||\x31||\x32||\x33||\x34||\x35||\x36||\x37||\x38||\x39||\x3a||\x3b||\x3c||\x3d||\x3e||\x3f||\x40||\x41||\x42||\x43||\x44||\x45||\x46||\x47||\x48||\x49||\x4a||\x4b||\x4c||\x4d||\x4e||\x4f||\x50||\x51||\x52||\x53||\x54||\x55||\x56||\x57||\x58||\x59||\x5a||\x5b||\x5c||\x5d||\x5e||\x5f||\x60||\x61||\x62||\x63||\x64||\x65||\x66||\x67||\x68||\x69||\x6a||\x6b||\x6c||\x6d||\x6e||\x6f||\x70||\x71||\x72||\x73||\x74||\x75||\x76||\x77||\x78||\x79||\x7a||\x7b||\x7c||\x7d||\x7e||\x7f||\x80||\x81||\x82||\x83||\x84||\x85||\x86||\x87||\x88||\x89||\x8a||\x8b||\x8c||\x8d||\x8e||\x8f||\x90||\x91||\x92||\x93||\x94||\x95||\x96||\x97||\x98||\x99||\x9a||\x9b||\x9c||\x9d||\x9e||\x9f||\xa0||\xa1||\xa2||\xa3||\xa4||\xa5||\xa6||\xa7||\xa8||\xa9||\xaa||\xab||\xac||\xad||\xae||\xaf||\xb0||\xb1||\xb2||\xb3||\xb4||\xb5||\xb6||\xb7||\xb8||\xb9||\xba||\xbb||\xbc||\xbd||\xbe||\xbf||\xc0||\xc1||\xc2||\xc3||\xc4||\xc5||\xc6||\xc7||\xc8||\xc9||\xca||\xcb||\xcc||\xcd||\xce||\xcf||\xd0||\xd1||\xd2||\xd3||\xd4||\xd5||\xd6||\xd7||\xd8||\xd9||\xda||\xdb||\xdc||\xdd||\xde||\xdf||\xe0||\xe1||\xe2||\xe3||\xe4||\xe5||\xe6||\xe7||\xe8||\xe9||\xea||\xeb||\xec||\xed||\xee||\xef||\xf0||\xf1||\xf2||\xf3||\xf4||\xf5||\xf6||\xf7||\xf8||\xf9||\xfa||\xfb||\xfc||\xfd||\xfe||\xff)"
The important parts are:
  • the tcpdump expression "6 443 expression "tcp[13] = 24"" (look only at tcp (6) PSH/ACK to port 443). Thanks to this great resource for clarifying and explaining in depth the tcpdump filters.
  • the regexp which was discussed earlier
So know, does this work? To be honest, I'm not 100% sure, because the script kiddies (aka Columbian ransom guys) backed off before we were able to test this. I tried the regexp against some traffic we had captured and I got matches for what I was expecting, but that is no real world scenario. For example, this filter is rendered useless if attacker uses any form of obfuscation through fragmentation. So I still need to test this in the lab, but I thought it could be worthwhile sharing this.
I expect these attacks to increase in the future since the THC PoC tool is publicly available to anyone.
This post is more or less a rehash of what has been written here, but Vincent goes in to much more detail and explains it much better than I do. Go read that for some real technical details.

Friday, December 16, 2011

Terminating overlapping VPN subnets on ASA

I had a question asked by a colleague on how we could have overlapping VPN networks terminate on an ASA. He was looking for VPN in multi-context ASA basically, or some kind of VRF aware ASA. None of the previous are possible in ASA; VPN is not possible in multi-context mode, and more generally ASA is not VRF aware. Another constraint was that the core router behind the ASA had also overlapping subnets, logically separated by VRFs (then going into an MPLS cloud, but we don't care about that part).
My initial idea was to impact customer sites by adding some nating (Policy NAT) to the remote subnets, so that to the VPN terminating ASA they would appear as a different translated address. That's the easy way out. But the point was not to impact customer sites, so I had to try something different.
And then I thought that we could encapsulate the overlapping subnets in a GRE tunnel that would pass through the ASA and terminate on the core router. The different GRE tunnel addresses would be used to segregate overlapping traffic.

So to sum up, here is the problem:
  • Multiple customers on the outside (WAN) with overlapping subnets
  • One terminating ASA for VPN traffic
  • Multiple customers on the inside (LAN) with overlapping subnets
How do we get these to play nice? Here is the test setup I cane up with:

The idea being to have the ASA as a termination point for VPN traffic, who would then pass through GRE tunnels to the router. The GRE tunnel would forward traffic to the correct VRF. The downside of this is that ASA cannot do GRE inspection, so the traffic flowing through the ASA will not have any policies applied to it.

ASA config:
hostname ASA
enable password 8Ry2YjIyt7RRXU24 encrypted
names
!
interface Ethernet0/0
 nameif outside
 security-level 0
 ip address 10.1.1.1 255.255.255.0
!
interface Ethernet0/1
 nameif inside
 security-level 100
 ip address 192.168.1.1 255.255.255.0
!
access-list VPN extended permit gre host 172.16.1.1 host 10.1.1.2
access-list VPN2 extended permit gre host 172.16.2.1 host 10.1.1.3
 
!
route inside 172.16.1.0 255.255.255.0 192.168.1.10 1
route inside 172.16.2.0 255.255.255.0 192.168.1.10 1
!
crypto ipsec transform-set 1 esp-aes esp-sha-hmac
crypto map 1 1 match address VPN
crypto map 1 1 set peer 10.1.1.2

crypto map 1 1 set transform-set 1
crypto map 1 2 match address VPN2
crypto map 1 2 set peer 10.1.1.3

crypto map 1 2 set transform-set 1
crypto map 1 interface outside
crypto isakmp enable outside
crypto isakmp policy 1
 authentication pre-share
 encryption aes
 hash sha
 group 2
 lifetime 86400
tunnel-group 10.1.1.2 type ipsec-l2l
tunnel-group 10.1.1.2 ipsec-attributes
 pre-shared-key *
tunnel-group 10.1.1.3 type ipsec-l2l
tunnel-group 10.1.1.3 ipsec-attributes
 pre-shared-key *
 The important point here is to allow GRE return traffic through the firewall. 172.16.1.1 and 172.16.2.1 are the GRE tunnel source addresses for customer 1 and customer 2, which will terminate on the router inside the ASA.
Now the game is more or less done, all we need to do is to put the traffic terminating on our GRE tunnel into the right VRF. Here is the config on the core router (R2 in our case):
Define the VRFs for our customers to allow the overlapping networks on the inside:
vrf definition c1
 rd 65000:1
 !
 address-family ipv4
 exit-address-family
!
vrf definition c2
 rd 65000:2
 !       
 address-family ipv4
 exit-address-family
!
Create some loopback interfaces to simulate a LAN network and to hold the source IPs of our GRE tunnels
interface Loopback0
 ip address 172.16.1.1 255.255.255.0
!
interface Loopback1
 vrf forwarding c1
 ip address 192.168.10.1 255.255.255.0

!
interface Loopback2
 vrf forwarding c2
 ip address 192.168.10.1 255.255.255.0

!
interface Loopback10
 ip address 172.16.2.1 255.255.255.0
!
Create the GRE tunnel interfaces, and forward traffic to the customers VRF:
interface Tunnel0
 vrf forwarding c1
 ip address 1.1.1.2 255.255.255.0
 tunnel source Loopback0
 tunnel destination 10.1.1.2
!
interface Tunnel2
 vrf forwarding c2
 ip address 2.2.2.2 255.255.255.0
 tunnel source Loopback10
 tunnel destination 10.1.1.3
!
Force the traffic from the VRF to the remote overlapping subnets to flow through our GRE tunnel:
ip route 0.0.0.0 0.0.0.0 192.168.1.1
ip route vrf c1 192.168.1.0 255.255.255.0 1.1.1.1
ip route vrf c2 192.168.1.0 255.255.255.0 2.2.2.1
The last thing is to give the GRE parameters to the customer so that he can setup his tunnel correctly. He can send whatever traffic through the tunnel, it will never overlap with other customers. So let's have a look at customer 1 on R1. We provide him the following parameters:
  • GRE: tunnel destination 172.16.1.1 and tunnel source must be his WAN address
  • IPSEC: tunnel destination must be 10.1.1.1 and provide him the crypto informations and keys
 R1 config:
crypto isakmp policy 1
 encr aes
 authentication pre-share
 group 2 
crypto isakmp key cisco address 10.1.1.1
!
crypto ipsec transform-set 1 esp-aes esp-sha-hmac
!
crypto map 1 1 ipsec-isakmp
 set peer 10.1.1.1
 set transform-set 1
 match address VPN
!
This is the overlapping subnet for both customers.
interface Loopback0
 ip address 192.168.1.1 255.255.255.0
!
GRE tunnel config: 
interface Tunnel0
 ip address 1.1.1.1 255.255.255.0
 tunnel source FastEthernet1/0
 tunnel destination 172.16.1.1
!
interface FastEthernet1/0
 ip address 10.1.1.2 255.255.255.0
 duplex auto
 speed auto
 crypto map 1
!
ip route 0.0.0.0 0.0.0.0 10.1.1.1
ip route 192.168.10.0 255.255.255.0 1.1.1.2
!        
ip access-list extended VPN
 permit gre host 10.1.1.2 host 172.16.1.1
The config for customer 2 (R3)is very similar (Just putting it here for reference). Only the GRE tunnel destination changes:

crypto isakmp policy 1
 encr aes
 authentication pre-share
 group 2 
crypto isakmp key cisco address 10.1.1.1
!
!
crypto ipsec transform-set 1 esp-aes esp-sha-hmac
!
crypto map 1 1 ipsec-isakmp
 set peer 10.1.1.1
 set transform-set 1
 match address VPN
!
interface Loopback0
 ip address 192.168.1.1 255.255.255.0
!
interface Tunnel0
 ip address 2.2.2.1 255.255.255.0
 tunnel source FastEthernet1/0
 tunnel destination 172.16.2.1
!
interface FastEthernet1/0
 ip address 10.1.1.3 255.255.255.0
 duplex auto
 speed auto
 crypto map 1
!
ip route 0.0.0.0 0.0.0.0 10.1.1.1
ip route 192.168.10.0 255.255.255.0 2.2.2.2
!        
ip access-list extended VPN
 permit gre host 10.1.1.3 host 172.16.2.1
So here we have taken care of overlapping subnets on both ends:
  1. On the WAN, we are hiding the overlapping subnets to the ASA using GRE encapsulation
  2. On the LAN, we use VRF to mask overlapping traffic to the router
Now let's check that traffic goes through both ways for overlapping subnets:
ASA(config)# sh crypto ipsec sa
interface: outside
    Crypto map tag: 1, seq num: 1, local addr: 10.1.1.1

      access-list VPN permit gre host 172.16.1.1 host 10.1.1.2
      local ident (addr/mask/prot/port): (172.16.1.1/255.255.255.255/47/0)
      remote ident (addr/mask/prot/port): (10.1.1.2/255.255.255.255/47/0)
      current_peer: 10.1.1.2

      #pkts encaps: 15, #pkts encrypt: 15, #pkts digest: 15
      #pkts decaps: 15, #pkts decrypt: 15, #pkts verify: 15

      #pkts compressed: 0, #pkts decompressed: 0
      #pkts not compressed: 15, #pkts comp failed: 0, #pkts decomp failed: 0
      #pre-frag successes: 0, #pre-frag failures: 0, #fragments created: 0
      #PMTUs sent: 0, #PMTUs rcvd: 0, #decapsulated frgs needing reassembly: 0
      #send errors: 0, #recv errors: 0

      local crypto endpt.: 10.1.1.1, remote crypto endpt.: 10.1.1.2

    Crypto map tag: 1, seq num: 2, local addr: 10.1.1.1

      access-list VPN2 permit gre host 172.16.2.1 host 10.1.1.3
      local ident (addr/mask/prot/port): (172.16.2.1/255.255.255.255/47/0)
      remote ident (addr/mask/prot/port): (10.1.1.3/255.255.255.255/47/0)
      current_peer: 10.1.1.3

      #pkts encaps: 15, #pkts encrypt: 15, #pkts digest: 15
      #pkts decaps: 15, #pkts decrypt: 15, #pkts verify: 15

      #pkts compressed: 0, #pkts decompressed: 0
      #pkts not compressed: 15, #pkts comp failed: 0, #pkts decomp failed: 0
      #pre-frag successes: 0, #pre-frag failures: 0, #fragments created: 0
      #PMTUs sent: 0, #PMTUs rcvd: 0, #decapsulated frgs needing reassembly: 0
      #send errors: 0, #recv errors: 0

      local crypto endpt.: 10.1.1.1, remote crypto endpt.: 10.1.1.3

R2#ping vrf c1 192.168.1.1
Type escape sequence to abort.
Sending 5, 100-byte ICMP Echos to 192.168.1.1, timeout is 2 seconds:
!!!!!
Success rate is 100 percent (5/5), round-trip min/avg/max = 28/33/44 ms

ASA(config)# sh crypto ipsec sa
interface: outside
    Crypto map tag: 1, seq num: 1, local addr: 10.1.1.1

      access-list VPN permit gre host 172.16.1.1 host 10.1.1.2
      local ident (addr/mask/prot/port): (172.16.1.1/255.255.255.255/47/0)
      remote ident (addr/mask/prot/port): (10.1.1.2/255.255.255.255/47/0)
      current_peer: 10.1.1.2

      #pkts encaps: 20, #pkts encrypt: 20, #pkts digest: 20
      #pkts decaps: 20, #pkts decrypt: 20, #pkts verify: 20

      #pkts compressed: 0, #pkts decompressed: 0
      #pkts not compressed: 20, #pkts comp failed: 0, #pkts decomp failed: 0
      #pre-frag successes: 0, #pre-frag failures: 0, #fragments created: 0
      #PMTUs sent: 0, #PMTUs rcvd: 0, #decapsulated frgs needing reassembly: 0
      #send errors: 0, #recv errors: 0

      local crypto endpt.: 10.1.1.1, remote crypto endpt.: 10.1.1.2

    Crypto map tag: 1, seq num: 2, local addr: 10.1.1.1

      access-list VPN2 permit gre host 172.16.2.1 host 10.1.1.3
      local ident (addr/mask/prot/port): (172.16.2.1/255.255.255.255/47/0)
      remote ident (addr/mask/prot/port): (10.1.1.3/255.255.255.255/47/0)
      current_peer: 10.1.1.3

      #pkts encaps: 15, #pkts encrypt: 15, #pkts digest: 15
      #pkts decaps: 15, #pkts decrypt: 15, #pkts verify: 15
      #pkts compressed: 0, #pkts decompressed: 0
      #pkts not compressed: 15, #pkts comp failed: 0, #pkts decomp failed: 0
      #pre-frag successes: 0, #pre-frag failures: 0, #fragments created: 0
      #PMTUs sent: 0, #PMTUs rcvd: 0, #decapsulated frgs needing reassembly: 0
      #send errors: 0, #recv errors: 0

      local crypto endpt.: 10.1.1.1, remote crypto endpt.: 10.1.1.3


R2#ping vrf c2 192.168.1.1
Type escape sequence to abort.
Sending 5, 100-byte ICMP Echos to 192.168.1.1, timeout is 2 seconds:
!!!!!
Success rate is 100 percent (5/5), round-trip min/avg/max = 16/24/32 ms
Not showing the ipsec sa here for clarity, but the count will increase on second sa.

R1#ping 192.168.10.1 
Type escape sequence to abort.
Sending 5, 100-byte ICMP Echos to 192.168.10.1, timeout is 2 seconds:
!!!!!
Success rate is 100 percent (5/5), round-trip min/avg/max = 24/32/44 ms

This works, but all of this would be much easier having the router terminate the VPN traffic (using tunnel protection) and put traffic in VRFs that way. The ASA would have just passed ESP traffic to the router. This was not an option in our case, because the core router did not have a hardware crypto card.

Thursday, December 15, 2011

Playing with IPv6 anycast addresses and tunneling

I was playing around with the idea that IPv6 anycast addresses could be useful to replace next-hop redundancy protocols. So I thought I'd give it a try and see how that ended up. My idea was to port known VPN design good practices (router and site redundancy) to the IPv6 and weigh the pros and cons.
So I created a small topology with one spoke and several hubs to get up a basic network:
In this case, R6 (left) is the client, and R9 (top) is a standalone hub, whilst R10 and R11 (bottom) are an HA hub pair. After that I went on and configure basic EIGRPv6 for the WAN domain and redistributed all connected interfaces into EIGRPv6. Here is the meaningful config for R5, it is more or less the same for R8 and R6:
interface GigabitEthernet1/0
 no ip address
 negotiation auto
 ipv6 address 2001:DB8::5/64
 ipv6 enable
 ipv6 eigrp 6
!
interface Serial2/0
 no ip address
 ipv6 address 1234::/64 anycast
 ipv6 address 2001:DB5::1/64
 ipv6 enable
 ipv6 eigrp 6
!
ipv6 router eigrp 6
 passive-interface Serial2/0
 eigrp router-id 5.5.5.5
The important part here is the "ipv6 address 1234::/64 anycast" command on the hub facing interface. What we are doing here using the anycast keyword is disabling Duplicate Address Detection (dad) on the interface. This means that we are allowing duplicate IPv6 addresses on the fa2/0 segment. That way we can have duplicate addresses on the lan segment (for router redundancy?). The same thing can be achieved by doing:
interface Serial2/0
  ipv6 address 1234::/64
  ipv6 nd dad attempts 0
 After that the trick is to redistribute your anycast subnet (1234::/64 in our case) into EIGRPv6, by adding the "ipv6 eigrp 6" at the interface level. What are we doing here? We are redistributing the anycast subnet into our core network, thus the spoke will learn about the anycast address through EIGRPv6. The point here is that the anycast address is learned both via R5 and R8, so R6 (the spoke) can decide based on the EIRGPv6 metrics which is the best path to get to 1234::1:
R6#sh ipv6 eigrp topology
EIGRP-IPv6 Topology Table for AS(6)/ID(6.6.6.6)
Codes: P - Passive, A - Active, U - Update, Q - Query, R - Reply,
       r - reply Status, s - sia Status

P 2001:DB9::/64, 1 successors, FD is 28416
        via FE80::C809:57FF:FED0:1C (28416/28160), GigabitEthernet1/0
P 1234::/64, 1 successors, FD is 28416
        via FE80::C809:57FF:FED0:1C (28416/28160), GigabitEthernet1/0

P 2001:DB8::/64, 1 successors, FD is 2816
        via Connected, GigabitEthernet1/0
P 2001:DB6::/64, 1 successors, FD is 128256
        via Connected, Loopback6
P 2001:DB5::/64, 1 successors, FD is 2170112
        via FE80::C805:57FF:FECE:1C (2170112/2169856), GigabitEthernet1/0
This provides network redundancy, because if one link of the network goes down, all you need to do is to wait for the routing protocol to converge, and you will hit another hub (or if you tune the link metrics the EIGRPv6 FS will kick in straight away).
Let's have a check at what this looks like on R6 when both R5 and R8 edge routers are up and running:
R6#traceroute 1234::1
Type escape sequence to abort.
Tracing the route to 1234::1

  1 2001:DB8::8 36 msec 20 msec 16 msec
  2 2001:DB9::10 52 msec 36 msec 36 msec

So we can see that we are going to R10 hub. We are going there because the network cost is lower on the the path to R8 (I artificially lowered the bandwith on the WAN facing R5 interface). Now let's shut down the R8 edge router facing interface and see the result:
*Dec 15 01:16:29.961: %DUAL-5-NBRCHANGE: EIGRP-IPv6 6: Neighbor FE80::C809:57FF:FED0:1C (GigabitEthernet1/0) is down: interface down
R6#sh ipv6 eigrp topology
EIGRP-IPv6 Topology Table for AS(6)/ID(6.6.6.6)
Codes: P - Passive, A - Active, U - Update, Q - Query, R - Reply,
       r - reply Status, s - sia Status

P 1234::/64, 1 successors, FD is 2170112
        via FE80::C805:57FF:FECE:1C (2170112/2169856), GigabitEthernet1/0

P 2001:DB8::/64, 1 successors, FD is 2816
        via Connected, GigabitEthernet1/0
P 2001:DB6::/64, 1 successors, FD is 128256
        via Connected, Loopback6
P 2001:DB5::/64, 1 successors, FD is 2170112
        via FE80::C805:57FF:FECE:1C (2170112/2169856), GigabitEthernet1/0

R6#traceroute 1234::1    
Type escape sequence to abort.
Tracing the route to 1234::1

  1 2001:DB8::5 48 msec 12 msec 16 msec
  2 2001:DB5::9 80 msec 12 msec 40 msec
We are taking a different network path, so we have achieved network redundancy in an easy way.
Now what about redundancy at the router level (between R10 and R11)? This works the same way exactly. Let me "no shut" the WAN facing interface of R8, to bring the primary link back up:
*Dec 15 01:21:05.325: %DUAL-5-NBRCHANGE: EIGRP-IPv6 6: Neighbor FE80::C809:57FF:FED0:1C (GigabitEthernet1/0) is up: new adjacency
 R6# traceroute 1234::1
Type escape sequence to abort.
Tracing the route to 1234::1

  1 2001:DB8::8 216 msec 100 msec 100 msec
  2 2001:DB9::10 144 msec 84 msec 24 msec
and "shut" the interface of R10 to simulate a failure of the hub. I will keep a ping running from the client to see how long the down time is:
R6#ping 1234::1 repeat 1000
Type escape sequence to abort.
Sending 1000, 100-byte ICMP Echos to 1234::1, timeout is 2 seconds:
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!...!!!!!!!!!!!!!!!!!!!!!!!!!!
R6#traceroute 1234::1     
Type escape sequence to abort.
Tracing the route to 1234::1

  1 2001:DB8::8 72 msec 104 msec 96 msec
  2 2001:DB9::11 512 msec 8 msec 28 msec
We can see that the redundancy at the router level (between R10 and R11) is achieved within 3 pings. 

I think that IPv6 anycast addresses has real potential to achieve HA for stateless protocols and at the network layer. I do not think as I though initially that it is a good replacement for router level redundancy, such as NHRP. Indeed these have the advantage of "converging" faster, and keeping track of there state on a separate channel. Also they can help achieve stateful redundancy for crypto.

There are still a few grey spots that I am unsure of:
  1. How does the router choose the anycast address to use when several live on the same lan segment? For example for R10 and R11, how does the router choose to which one to speak? It would be necessary to be able to influence the choice of the primary and secondary hubs.
  2. Can we use anycast to achieve load-balancing by making sure the metrics of the links are the same? I think this is a bad idea, but...
The next step here is to add redundancy at the tunnel level. I used plain GRE tunnels, but principle would be the same for IPSEC. On R6 use the anycast address as the GRE destination of your tunnel. That way if one tunnel fails, you will be able to failover to the other hub:
R6#sh run int tun0
Building configuration...

Current configuration : 178 bytes
!
interface Tunnel0
 no ip address
 ipv6 address 2001::6/64
 ipv6 enable
 keepalive 10 3
 tunnel source GigabitEthernet1/0
 tunnel mode gre ipv6
 tunnel destination 1234::1
end
I think DMVPN would integrate perfectly in this setup, but Cisco doesn't yet support DMVPN for NBMA IPv6, so the story cannot continue now...