Fixing the "too many open files" error

If your service is dying from a "Too many open files" error, don't increase the nofile limit in /etc/security/limits.conf like most of the internet says. Instead, do what most of the internet does, and hard-code a ulimit -n 100000 line in your /etc/init.d/ script.

Discovery process

I was working on a Kafka problem with too many files. I looked to Cassandra for the fix, knowing that it had an increased limit.

According to most of the internet, the right way to increase the limit on open files is to edit /etc/security/limits.conf (or put a file in /etc/security/limits.d/*.conf, which does the same thing). If you do that, log out, then back in, and running ulimit -l will show you your new limit. Run a process, and cat /proc/[pid]/limits and you'll see the limits took effect. Done, right?

Nope. Set the limits for the user cassandra (which has no shell) just like they recommend. Good luck verifying that they took effect. sudo su -s /bin/bash -c "ulimit -l" shows your own limits. But if you run ulimit -l from cron.d as the cassandra user, it'll show the correct limits.

But the Cassandra process runs with the correct limits, so maybe it's because it uses start-stop-daemon instead of su to launch the process. Nope, apparently start-stop-daemon doesn't read the limits file. Debian has an open bug with a patch to fix it. Given it's been open since 2005, I'm sure it'll be merged in soon.

So how does Cassandra start with the right file limit?

Oh, they just hard code the limit in the init script

FD_LIMIT=100000
ulimit -n 100000

Cheaters. If only they admitted this in their docs, I wouldn't have gone down the wrong path.