Archive for October, 2011

Good CoPP, Bad CoPP – Balanced Policing

Right – this might be a bit long.. I haven’t yet worked out how to make a ‘short but sucint’ blog post..

I had one of those ‘tada‘ or ‘eurika‘ or ‘bloody hell, why didn’t that occur to me earlier’ moments this afternoon (it was the latter).  You know what it’s like:- you’re at the end of your tether trying to get something to work, you’ve been fumbling around for hours, and out of the corner of your brain a little flicker resembling a thought process occurs.  You give it a moment to surface, chew on it and then ‘oh hell, I’m an idiot’ as you prove you’ve fixed your own mess of a problem. This happened while staring at a multicast convergence problem today – and it was all due to bad CoPP.

CoPP – or Control Plane Policing – is regarded by Cisco as a security feature/mechanism. It’s designed to protect the switch’s CPU from being overwhelmed by control-plane traffic (whether that traffic is legitimate, accidental or the likes of a DoS attack).  The Catalyst 6500 had it – but no-one ever seemed to configured it. In the Nexus 7000 and all new NX-OS based switches, it’s a default configuration (unless you’re a monkey and choose ‘none’ during the startup script).

CoPP is configured using MQC and allows you define classes of traffic that might head to the CPU and apply a policing policy to it..  Usually, the default policies (lenient, moderate or strict) are fine for most network deployments, say 80% of them.  For the other 20% you have to tweak the policy a little.

Case 1 Bad CoPP – A common [and in my opinion, stupid] method for a server detecting the loss of a gateway is by using ARP. In the old days, this wasn’t a problem – the 6500 would just churn through the ARPs and spit out responses, perhaps missing one or two. When a customer decided to test the Nexus 7000, they found that their servers kept seeing gateway losses.  Turns out, the default policy was being exceeded by the sheer number of servers sending out ARPs. The customer moaned. Of course they would, they think the switch is bad and broken and.. anyways. So to get over the [stupid stupid idea] the CoPP is tuned to allow more ARP up to the CPU.  It’s a solution to a problem that’s easy to implement, rather than fixing the fact that your servers don’t need to ARP for the gateway as you have HSRP (but hey, who am I to argue with the customer?).

Case 2 Good CoPP – A customer has a setup where they need a fast multicast-convergence time but are also receiving the same (S,G) streams on two different interfaces.  Fast multicast-convergence means we need to register the multicast frames with the RP as quickly as possible, so the RP can learn and then (as we happen to be using PIM Anycast-RP) relay the PIM registers to other RPs.  For this, we can increase the policing of PIM protocol messages (the default was 200pps, so we upped it to 600pps).  This is fine, we’re just allowing the policy to scale upwards.

The trouble of balancing the CoPP came with the (S,G)s being received on two interfaces.  In multicast we can only have one interface being the incoming interface – this incoming interface is determined by the RPF check (reverse path forwarding) and is programmed into hardware.  Once programmed, all matching multicast on that interface is forwarded in hardware.  The (S,G)s being received at the non-RPF interface would not match a hardware (FIB) entry – and thus would be forwarded up to CPU for software processing.  The problem with this is we’re punting useless traffic up to the CPU, wasting CPU resources and preventing that inband bandwidth for being used for other things (such as that fast multicast convergence).  When testing this initially, I identified IPMCMISS as the class which this useless, RPF-failing, traffic was hitting in CoPP and trimmed it right back to 10pps.  When I went to do another convergence test, I found that convergence was super-slow, even though I had tuned the PIMREG upwards.

What I didn’t realised was that IPMCMISS doesn’t actually [just] match RPF-failing traffic – it actually matches any multicast traffic that triggers a ‘FIB-miss’ – this was the ‘bloody hell’ moment.  Whenever we receive multicast traffic into hardware, and there’s no hardware-programmed FIB entry, it’s a FIB-miss – and this is punted to CPU for processing or software switching.  FIB-miss would be triggered the first time we see an (S,G), which is how we get into the process of punting to CPU, PIM learning, inserting into MRIB, generating a PIM Register and programming the hardware. So by cutting away the bandwidth available to IPMCMISS, I was also reducing the chance of new (S,G) frames making it to the CPU for learning.

So to summarise – I now I have to work out how to balance policing of the useless traffic and wasting CPU bandwidth against the need to learn new (S,G)s.  I would never condone opening CoPP up for something like ARP, it sounds silly to me.

The end.

PS – [Just an updated thought] – It’s worth noting that on the N7000 you can define different class-maps for IPMCMISS and RPF-failing traffic, you can’t yet do this on the N3000, and I have yet to check the N5x00.
class-map type control-plane match-any mc-rpf-fail
match exception multicast rpf-failure

, , ,

No Comments

VM-FEX and VXLAN

So yesterday I had a chance to read up on both VXLAN (Virtual eXtensible LAN) and VM-FEX, as well as having a good discussion with Greg Ferro (@etherealmind) about VXLAN and he introduced me to the concept of OpenFlow.

My source for VM-FEX was a whitepaper by Cisco on ‘Unify Virtual and Physical Networking with Cisco Virtual Interface Card‘ – which made things pretty easy to understand. The short story is; we attached vNICs to virtual machines using VMWare’s DirectPath – the VM sees a NIC as normal, in vCenter it sees a vNIC, in reality it’s a hardware-based NIC emulation on the VIC. Instead of having a virtual switch on the host we do PTS (Pass Thru Switching) and the vNIC is bound to a VIF (Virtual InterFace) further up the path on a real physical switch. That VIF is presented just like a normal switchport from a configuration point of view.. ie, it looks the same as a switchport on the end of a FEX (Fabric Extender.. aka 2232, 2148 etc). This VM-FEX vNIC supports vMotion in vSphere 5 by doing some fancy stuff around the NIC registers and state information that’s on the VIC. Now, this VM-FEX technology only currently works inside UCS – so we’ll have to wait to see how/if it can be implemented outside of that.

Now to VXLAN.. When I read Coding Relic’s write-up on how VXLAN works (there’s three pages, but they’re all good), I couldn’t help thinking “This is OTV”. In fact, even after a quick discussion with Greg about the matter – I still think it’s OTV. The only difference is, we’re not terminating Layer-2 on a switch somewhere, we’re terminating it directly on the host machine. So now, these VXLANS only exist on the hosts – they don’t exist on the underlying infrastructure – which got me to thinking about how this scales. In a normal vSwitch/dvSwitch/1000v environment, the virtual-switch on the host only needs to learn the MAC addresses of the directly connected VMs – everything else is northbound on the physical infrastructure, so there’s only one way to send it (ignoring all the stuff about mac-pinning and port-channeling, blah). Now, we have VXLANS and only the hosts know what’s on that VXLAN – so essentially, the host now needs to have a bunch of MAC lookup tables (much like TCAM in a physical switch). Using similar control-plane methods as OTV, it learns the MAC addresses of VMs on other hosts via multicasts and then stores that information locally. The whole point of VXLAN is to break out of the 4096-VLAN limit and allow easy multi-tenancy – but how much overhead does learning all the MACs of all the VMs on all these new VXLANs add to the host itself? Of course, the obvious bad-points around VXLAN is visibility of traffic on the underlying infrastructure, policy enforcement has to take place on the virtual-switch and there’s an added layer of troubleshooting to do.

OpenFlow is the next topic on my reading list.. I’ve had a quick introduction by Greg but I need to do the reading too.

, , , , , , ,

No Comments

The New Job

So – time for that promised update on my new role in Cisco..

Datacenter Solution Engineer – I won’t give you the complete speil that’s in the job-spec but suffices to say this brings together my background in DC testing and my interests in virtualisation. The headline reads:

TS Data-Centre Solution Engineer – will be responsible for orchestrating the end-to-end solution support of Cisco’s Data Centre Solutions ( e.g. UCS & Nexus Architectures, Vblock/FlexPod, VDI/VXI, Cloud solutions – Private and Public, etc).

So basically – I help to deliver DC solutions which encompass the whole DC architecture. From the R&S side of things (Nexus), end-hosts (UCS), storage networking (MDS), and the virtualisation part (VMWare, Hyper-V, 1000v). There’s a flip side to this as well; I help work to bring the Partners, vendors and customers up to speed on new technologies and features. On the side, there’ll also be some escalation support to the APAC TAC team.

That’s the short version of what I’ll be doing. There’s still a long way before I feel confident in the role though. DC switching and routing technologies are fairly well known but anything that involves virtualisation appears to be moving at the speed of light in terms of development. On my list of things I already have to catch up on: VM-FEX, VXLAN, OpenFlow and Hadoop clusters. I’m putting Scott Lowe‘s Mastering VMWare vSphere 5 on my reading list for this month too.

Next update – some good virtualisation bloggers and tweeters.

No Comments