It's always a networking problem...unless it's not...

I was recently challenged by a software vendor to prove an issue with their application wasn't the load balancer behind which the application server was running.

Being the person who had provisioned the VMs, the proxy, the load balancer, the NFS, the routing, the NATing, and the firewall this particular application was using, I had a pretty good idea how traffic was flowing. What stymied me on the call however was how I could objectively prove a negative in a way that made sense. There was a LOT of complexity to this environment and many ways things could be going wrong. Maybe it WAS the networking, but 'maybe' isn't good enough when millions of dollars are on a tight deadline. I needed something better than 'maybe', so I went to the source of truth on every network...the packets.

TCP connections rely on what is known as a three-way handshake. The client makes a request to a service by sending a special packet called a SYN packet. Assuming the SYN packet makes it to the server hosting the service, the service responds with an ACK packet. When the ACK is received, the client assumes all is clear for the connection to proceed and replies with a SYN-ACK packet, establishing the session between the two devices. This session would look something like this:

src (client)                dst (server)
10.0.0.4:40887              6.43.3.27:443

In the picture above, 6.43.3.27 needs to have a service listening on socket 6.43.3.27:443, which is a combination of the server IP address and the service port number. Assuming an open network, the client (10.0.0.4) should then be able to make a connection (SYN) request. If all goes well, a session is created with the server which is assigned a unique port (in this case, 40887) on the client. This forms an ephemeral socket connection on the client which is only active for the duration of the session. Clients and application services make millions of these kinds of connections over networks, from your little home WIFI network to the big, bad, Internet. It's a standard way for computers to communicate today, but can get a bit complicated when the packets have to jump through many devices before reaching their destination. For this reason, networking - the infrastructure that handles packet flow - is often blamed when things don't work. Sometimes the network IS to blame, however just as often...if not more often in my experience...it's something with the application.

Now, with that little review of TCP networking out of the way, let's move on to some of the challenges I've faced in the real world. Every developer, sales rep, and vendor will tell you their app just works. In a traditional stack, it's as simple as setting up the database server, installing the application, then logging in. Easy, right? But it never is. If I'm dropping a million dollars on a product, it needs to have some fault tolerance, which means load balancing and redundancy. It also needs some security monitoring and filters, which means end-point protection and firewalls. What about compliance? We need that too, or we lose millions more to breached contracts and regulation fines. Data protection? Now we're into encryption, SSO, Zero Trust, MFA, and every other security industry buzz word out there. Before long the application isn't working at all. The database can't understand the encryption protocol you're using. The TLS cert being used isn't from a trusted CA. The application server itself is being crushed by all the additional services you've had to install. The compliance team is saying you need to disable a port that is required by the application. And on top of all of this the additional worker nodes you had to install to help the dying application server aren't communicating correctly with the app server. The vendor shrugs and says "It's probably the network.", but neither of us know that for sure.

Enter TCPDump. TCPDump is one of many tools used to check out the packets flowing to and from your device. The trick with TCPDump is that it's largely passive, meaning it watches the traffic without altering anything. If I'm a developer with a deep understanding of my application, this can be very helpful to debug issues. But maybe I'm not a developer. Maybe I'm just an engineer who knows a little about servers and how they talk to each other. A packet capture of an unknown application won't mean a whole lot. What would be REALLY nice is to have my own little session which I can fully control. Of course in such a situation I would reach for the venerable Netcat program. With Netcat I can  make a client connection to an application and watch the transaction with TCPDump to see how it goes, all without needing to know anything about the application itself other than the IP that's hosting it and the service port. What follows is exactly how I managed that during a real troubleshoot session.

Troubleshoot a simple connection

To perform a full inspection of a session, I need to have access to both application server (the thing running the service) and a client accessing said service. TCPDump will be running on the application server so I can observe what the traffic looks like coming from the client. NetCat will be used on the client to fire off customized requests. I could also use the application if I was familiar enough, but this can get noisy and error prone depending on how much it's trying to communicate. NetCat offers a much more precise tool for testing, assuming you can install it without running afoul of policy.

Here is what a simple client/server connection would look like from both sides.

Client View (sending communication):

nc 10.0.0.8 80


$ warydev: nc 10.0.0.8 80
amazinggrace 
HTTP/1.1 400 Bad Request
...

Server Side (receiving communication):

tcpdump -i enp0s3 -vvnnXX port 80 | grep amazinggrace


$ waryfirewall: tcpdump -i enp0s3 -vvnnXX port 80 | grep amazinggrace
tcpdump: listening on enp0s3, link-type EN10MB (Ethernet), capture size 262144 bytes
	0x0040:  678e 616d 617a 696e 6767 7261 6365 0a    g.amazinggrace.

As you can see, when the client (nc) sent the string 'amazinggrace', we could see it come through the server on the other end using tcpdump, despite the fact the service on the other end had no idea what 'amazinggrace' means. Cool right?

Of course this ALL assumes we're using unsecure ports and unencrypted traffic. Had this been running on a secure port like 443, such a simple little test like this wouldn't work...

Or would it?

Troubleshooting Encrypted channels...

The funny thing about networking and MOST firewalls is that they don't care about the security of bits traveling over the wire. Encryption happens at a much higher layer in the stack, Layer 6 to be exact if you're following the OSI Model. When it comes to most 'networking' issues, those are down in L1 (wire), L2 (switch), and L3 (router). Firewalls usually live at L3 if we're talking traditional IP/Port ACL rules. If you have a rule allowing Server1 to talk with Server2 over TCP/80, the firewall couldn't give two bits what that communication looked like from a purely networking perspective, only that there was an Access Control List/Rule that allows the session to form between the two IPs. Sorry to break it to you, but there is no such thing as a 'secure port'. One can run any protocol over any port they desire as long as that port isn't already in use. The only real data security we get usually comes in around L6 where encryption ciphers can wrap up your plain text communications. HTTPS running over 443 is just a standard, nothing more.

So let's apply this to our scenario. The vendor is telling you the app isn't working right because of the network. What to you do? KISS it.

Keep
It
Simple,
Stupid.

Turn off their app running on whatever "secured' port they've chosen that supposedly isn't working and replace it with a simple Netcat listener tuned to that port.  Let's use TCP/8443 as an example, just to be different. Here's how it would look:


$ appserver: nc -lv 8443
Ncat: Listening on :::8443
Ncat: Listening on 0.0.0.0:8443

Now reconnect from the client to this port like in my previous example.


$ warydev: nc 10.0.0.8 8443
amazinggrace

If networking is good, you should see connection made on the other end.


$ appserver: nc -lv 8443
Ncat: Listening on :::8443
Ncat: Listening on 0.0.0.0:8443

Ncat: Connection from 10.0.0.86.
Ncat: Connection from 10.0.0.86:37988.
amazinggrace

See that nice socket connection? And you can't argue the data is getting mangled because it's right there in the console. There isn't any need to recreate the encryption channel for this level of troubleshooting because all that is doing is wrapping the data, again, at Layer 6. The networking, in this case, is good. If the application isn't receiving data, there is something else going.  

💡
Note: I know some of you out there have worked in environments that deployed deep packet inspection technology and you're dying to tell me I'm wrong because of this fringe case. "The NextGen firewall dug into my packets, said they were bad, and blocked the connection!" My answer to this is a question: Is that REALLY a networking issue? When a device terminates a TLS session at the parementer to look for bad stuff coming or going, then ends up killing the connection instead of forwarding it because the algorthms an analyst configured incorrectly identified a packet as malcious, it's a policy issue, not the network. You are welcome to disagree with my assessment of this fringe case, but I felt it only fair to both of us I addressed this potential caveat so we could both feel right.

Troubleshooting through proxies...

The last example I want to show is what this kind of test might look like through a proxy or loadbalancer. These systems are VERY common and DO change the networking between a client and the app because they usually catch the connection and replay it from themselves. This means the application will likely see the PROXY IP coming in instead of the client (again, there a some tricks around this, but don't usually come out of a box). 90% of the time this behavior is fine because the web/app server doesn't care where connections come from, but it can make troubleshooting a little difficult without the ability to perform very precise testing.

In my lab I have a Squid proxy set up to relay HTTP traffic over port TCP/80. Here's how I would go about testing to pass web traffic through a web proxy in a way I can see it on the other end without the Squid proxy chucking me out:

First, I setup my tcpdump as before on the web/app server. I know the proxy is in the middle and will actually be what I'm pointing at, however the traffic is supposed to redirect to my web/app server, so that's where I'll be listening.

[appserver ~]# tcpdump -i enp0s8 -nnvvXX

Now, after a connection attempt from my client using curl, tacking on our test string 'amazinggrace'. I should see something like this in my packet capture:


#CLIENT

$ curl http://10.0.0.155/amazinggrace


404 Not Found


# WEB/APP SERVER

$ tcpdump -i enp0s8 -nnvvXX
dropped privs to tcpdump
tcpdump: listening on enp0s8, link-type EN10MB (Ethernet), capture size 262144 bytes
...
    10.0.0.155.52466 > 10.0.0.156.80: Flags [P.], cksum 0xf273 (correct), seq 1:520, ack 1, win 229, options [nop,nop,TS val 765301136 ecr 3503452604], length 519: HTTP, length: 519
	GET /amazinggrace HTTP/1.1
...

In this scenario, the 10.0.0.155 IP is actually the proxy, not the client. But you can see it's passing the client connection data (again, /amazinggrace) and it's successfully making it through to the web/app server (10.0.0.156). Whether that link actually exists is irrelevant, just like whether it's encrypted or not. What's important about this troubleshooting is that it proves the networking and traditional firewall ACLs are passing traffic as expected. Any issues going on beyond this will lie elsewhere, be it server configurations or the application itself.

Summary:

Technology is a tough gig in and of itself, but it's only made harder when conclusions are drawn without proper evidence. Hopefully this article provides some tips and tricks to use next time your told 'it's the network'. Take a few minutes to actually troubleshoot the issue. Look through some change logs to see if something was moved in the environment (like, was the host server rebooted without the service being enabled...if I had a dollar for every time THAT has happened). Even if you don't have FULL access to all the legs of the journey, test what you can and ask your server/network team to validate the logs before assigning blame and subsequent tickets to already overloaded teams.

Lastly (and this is to my younger engineers out there), trust yourself. I don't mean be an ass. But if you understand how something works (because...you built it...), don't be afraid to stand your ground on how it's supposed to work. If it's not, why? Break it down in pieces and test each step. Do you get the expected results from Step 1, 2, and 3? What changes at Step 4? Ask your vendor and other teams to do the same in order to find the issue collaboratively to avoid dumping the problem (and your reputation) on someone else to sort out.

Cheers!

p.s. I'm listing one of the best TCPDump guides I've ever found below. Thank you, Daniel Miessler!

https://danielmiessler.com/study/tcpdump/