January 2nd, 2014 by Justin
Our average traffic at Chartbeat has grown about 33% over the last year and depending on news events, we can see our traffic jump 33% or more in a single day.  Recently we’ve begun investigating ways we can improve performance for handling this traffic through our systems.  We set out and collected additional metrics from our systems and we were able to reduce TCP retry timeouts, reduce CPU usage across our front end machines by about 20%, and improve our average response time from 175ms to 30ms.

History

First, a brief overview of our architecture.  Currently we are hosted on Amazon Web Services.  Every 15 seconds, we receive data from our javascript that is embedded on clients’ pages.  This data is sent to ping.chartbeat.net.  Behind ping.chartbeat.net sit a number of m1.large servers running Nginx that handle the proxying of these “pings” to our realtime infrastructure.  For this article I’ll be focusing on how we improved the performance of receiving and proxying the data sent to ping.chartbeat.net.
In 2009 when the company was first starting out, we used round robin DNS to load balance the traffic for ping.chartbeat.net.   While this scaled fine for the first year, DNS can only serve up a maximum 13 records due to UDP’s packet length limit of 512 bytes.  This obviously wasn’t going to be a long term solution for us if we wanted to grow the company.  In 2010 we switched our DNS provider to the folks at Dyn.  Dyn offers a load balancing service that handles the monitoring of the IPs to ensure they are reachable and automatically pulls them from being served up in the DNS response.   Each server was assigned an elastic IP from Amazon so that we could handle failures by just moving the IP to a new server rather than having to update the service with a new IP when a server failure occurred.  We switched ping.chartbeat.net over to Dyn’s DNS load balancing service and have used them over the last 3 years.

Cons of the setup

While Dyn’s service has been great over the last 3 years, they unfortunately have no control over the problems that exist within DNS itself.  When we experienced a server failure on our front end servers, Dyn would pull the IP address from being served up in ping.chartbeat.net.  With a low enough TTL on the record, the IP would stop being served up from DNS caches.  Unfortunately there are a large number of (misbehaving) DNS servers out there that don’t properly obey TTLs on records and will still serve up stale records for an indefinite amount of time.  This meant that we were not getting data for clients who were still being served up the stale record while that server was being replaced.  This also presented problems for when we scaled our traffic to handle large events.  When we went to scale our infrastructure back down, clients were still being served up the old IPs of the servers we had removed from rotation.  We would see traffic flowing in days after we removed the records which had TTLs set for only a few minutes.  Eventually we had to have a cut off point so we could shut down the servers but this meant dropped pings from clients, which we want to avoid as much as we can.
DNS requests being distributed evenly does not mean we will see traffic get evenly distributed across our servers.  Many users are behind proxies that would send thousands of users to one server.  We regularly saw variances of up to 1000 req/sec across our front end servers.  Not being able to efficiently balance the traffic meant we were running with more servers than necessary.

Why didn’t you just use an ELB?

At the time, ELB was still a very young service from Amazon.  ELB was launched without support for SSL termination, a feature we needed.  SSL support was launched in 2010 but  without the ability to do full end to end SSL encryption.  Full end to end encryption was launched later, in the summer of 2011.  By the time the features we fully needed were supported, AWS had experienced some major outages involving the ELB service which made us hesitant to move towards using it right away.

Improving Performance and Reliability

The first step to improving our performance was to graph and measure the response times on our front end servers.  We’re utilizing Etsy’s Logster to parse the access logs every minute and output the data to statsd which then makes its way into graphite.  We run logster under ionice and nice in order to minimize the impact the log parsing has on the performance of the server each minute.  We also write our logs to a separate ephemeral partition formatted with ext4.  Using ext4 instead of ext3, we saw better performance when rotating out log files.
In addition to graphite, we are using Ganglia to graph system level metrics.  We are graphing numerous metrics from the output of netstat -s such as number of TCP timeouts, and Socket listen queue overflows.  We also graph the counts for each state a socket can be in.  Due to the number of sockets we have on each server, it’s not possible to use netstat to view the states of all the sockets without it affecting performance and taking 5 minutes or longer to return.  Thanks to the folks in #ganglia on Freenode, we learned about using the command ss -s as an alternate way to get socket information that returns in a fraction of the time.
Some initial numbers from a front end server
  • 80k packets/sec at peak
  • 200-250ms request times
  • 2k-4k req/sec
  • 80% avg CPU, unfortunately we are on an old version of the Ganglia CPU plugin and initially didn’t realize we were not graphing steal % time , and were not getting a true picture of our CPU utilization and had believed this number was lower.
  • 350k connections in TIME_WAIT (yikes)
  • 14k interrupts/sec

Graph all the things! (then actually look at the graphs)

Now that we had all these metrics being graphed, it was important to actually understand how these metrics were affecting our reliability and performance.  Two graphs that stood out right away were our graphs of “SYNs to LISTEN sockets dropped” and “times the listen queue of a socket overflowed”.
syns_to_socket_listen_dropsAny metric that contains the words “dropped” or “overflow” with values that were increasing couldn’t be a good thing.
What do these values actually mean though?  When calling the listen function in BSD and POSIX sockets, an argument is accepted called the backlog.  From the listen man page
The backlog argument defines the maximum length to which the queue of pending connections for sockfd  may grow. If a connection request arrives when the queue is full, the client may receive an error with an indication of ECONNREFUSED or, if the underlying protocol supports retransmission, the request may be ignored so that a later re-attempt at connection succeeds.
and a very important note
If  the  backlog  argument is greater than the value in /proc/sys/net/core/somaxconn, then it is silently truncated to that value; the default value in this file is 128.  In kernels before 2.4.25, this limit was a hard coded value, SOMAXCONN, with the value 128.
We were dropping packets since the backlog queue was filling up.  Worse, clients will wait 3 seconds before re-sending the SYN, and then 9 seconds if that SYN doesn’t get through again.
Another symptom we saw when looking at /var/log/messages was this message showing up
[84440.731929] possible SYN flooding on port 80. Sending cookies.
Were we being SYN flooded?  Not an unreasonable thing to expect with the servers exposed to the internet, but it turns out this message can send you looking in the wrong direction. Couldn’t we just turn off SYN cookies to make this message stop and just cover our eyes and pretend its not happening?  Digging into this message further we learned , SYN cookies are only triggered when the SYN backlog overflows.  Why were we overflowing the SYN backlog?  We had set net.core.somaxconn and net.core.netdev_max_backlog both to 16384 and number of sockets in SYN_RECV state was no where near those numbers.  Looking at some packet captures it did not appear we were actually under a SYN flood attack of any kind.  It seemed like our settings for max SYN backlog were being ignored, but why?  While researching more information about this, we came across some folks with a similar issue and discovered Nginx had a default of 511 for the backlog value if one was not set, it did not default to using the value of SOMAXCONN.  We raised the backlog value in the listen statement (e.g listen 80 backlog=16384) to match our sysctl settings and the SYN cookie messages disappeared from /var/log/messages.  The number of TCP listen queue overflows went to 0 in our ganglia graphs and we were no longer dropping packets.
listen_queue_overflows

More sysctl TCP tuning

After fixing the backlog issue, it was time to review our existing sysctl settings.  We’ve had some tunings in place for a while but it had been some time since they were reviewed to ensure they still made sense for us.  There’s a lot of bad information out on the web on tuning TCP settings under sysctl that people just blindly apply to their servers.  Often times these resources don’t bother explaining why they are setting a certain sysctl parameter and just give you a file to put in place and tell you this will give you the best performance.  You should be sure you fully understand any value you are changing under sysctl.  You can seriously affect the performance of your server with the wrong values or certain options even enabled in the wrong environments.  The TCP man page and TCP/IP Illustrated: The Implementation, Vol 2  were great resources in helping to understand these parameters.
Our current sysctl modifications as they stand today are as follows (included with comments), Disclaimer: please don’t just use these settings on your servers without understanding them first
# Max receive buffer size (8 Mb)
net.core.rmem_max=8388608
# Max send buffer size (8 Mb)
net.core.wmem_max=8388608
# Default receive buffer size
net.core.rmem_default=65536
# Default send buffer size
net.core.wmem_default=65536
# The first value tells the kernel the minimum receive/send buffer for each TCP connection,
# and this buffer is always allocated to a TCP socket,
# even under high pressure on the system. …
# The second value specified tells the kernel the default receive/send buffer
# allocated for each TCP socket. This value overrides the /proc/sys/net/core/rmem_default
# value used by other protocols. … The third and last value specified
# in this variable specifies the maximum receive/send buffer that can be allocated for a TCP socket.
# Note: The kernel will auto tune these values between the min-max range
# If for some reason you wanted to change this behavior, disable net.ipv4.tcp_moderate_rcvbuf
net.ipv4.tcp_rmem=8192 873800 8388608
net.ipv4.tcp_wmem=4096 655360 8388608
# Units are in page size (default page size is 4 kb)
# These are global variables affecting total pages for TCP
# sockets
# 8388608 * 4 = 32 GB
#  low pressure high
#  When mem allocated by TCP exceeds “pressure”, kernel will put pressure on TCP memory
#  We set all these values high to basically prevent any mem pressure from ever occurring
#  on our TCP sockets
net.ipv4.tcp_mem=8388608 8388608 8388608
# Increase max number of sockets allowed in TIME_WAIT
net.ipv4.tcp_max_tw_buckets=6000000
# Increase max half-open connections.
net.ipv4.tcp_max_syn_backlog=65536
# Increase max TCP orphans
# These are sockets which have been closed and no longer have a file handle attached to them
net.ipv4.tcp_max_orphans=262144
# Max listen queue backlog
# make sure to increase nginx backlog as well if changed
net.core.somaxconn = 16384
# Max number of packets that can be queued on interface input
# If kernel is receiving packets faster than can be processed
# this queue increases
net.core.netdev_max_backlog = 16384
# Only retry creating TCP connections twice
# Minimize the time it takes for a connection attempt to fail
net.ipv4.tcp_synack_retries = 2
net.ipv4.tcp_syn_retries = 2
# Timeout closing of TCP connections after 7 seconds
net.ipv4.tcp_fin_timeout = 7
# Avoid falling back to slow start after a connection goes idle
# keeps our cwnd large with the keep alive connections
net.ipv4.tcp_slow_start_after_idle = 0
A couple of additional settings we looked at were tcp_tw_reusetcp_tw_recycle and tcp_no_metrics_save.
After reading about tcp_tw_recycle, right away we decided we did not want to have it enabled.  You should really never enable this as it will affect any connections involving NAT and result in dropped connections.  See http://stackoverflow.com/questions/8893888/dropping-of-connections-with-tcp-tw-recycle for more information.  This is an example of a setting I often see in TCP performance blog posts where the option is set to enabled with no explanation as to why and the dangers involved in having it enabled.
tcp_tw_reuse is a safer way to reduce the number of sockets you have in a TIME_WAIT state.  The kernel will allow re-using a socket if it’s deemed safe from a protocol standpoint when a socket is in a TIME_WAIT state. In our testing we really didn’t see a discernible difference with this option enabled so we left it disabled.
By default, the kernel keeps metrics on each TCP connection established to it.  Values like rto(retransmission timeout), ssthresh(slow start threshold), cwnd(congestion window size) are kept on each connection so that when the kernel sees that connection again, it can re-use those values in hopes of optimizing the connection.  You can view these values with the command ip route show cache.  There are cases when you may not want this option enabled.  A slow client coming from behind a NAT can end up causing non-optimal values to be cached and have them get re-used for the next person connecting from that IP.  If your clients are mostly on mobile where the connection is very erratic, you’ll want to consider enabling this option since most likely each session will have different optimal connection settings.
Some important takeaways from this first part are
  • DNS is not a great means of load balancing traffic
  • Modifying sysctl values from their defaults can be important to ensure reliability
  • Graphing metrics is your friend
In part 2 of this blog post, we’ll explore our move to using Amazon’s ELB, enabling of HTTP keep alive and look at some graphs showing the impact of the changes.