Problems in production
I have been a tad busy fixing some weird little bugs lately. They helped me appreciate the multitude of things that can go wrong in a live environment and served as a gentle reminder that you should always be on your heels.
Here they are
1. LDAP and the user
A web application product was configured to use an LDAP directory structure. The directory was segregated into roles / groups / OUs, the usual. One of the users had trouble logging into the product. This was weird because this person was a valid user and Outlook seemed to recognize him. Outlook uses the same LDAP tree. So I dug into it. The easiest way to check what is going on is to use a LDAP directory browser. I use the one provided by Novell for free. The LDAPs support is not great but it will do for basic lookups.
So as I dug into the tree, it turns out this user was mapped as a group. Yes a group. The product was configured so that only the “user accounts” LDAP directory was looked into for valid users. So since this user was, ummm a group, the application was unable to find him. My only thought was ‘wow ! how did this go under the radar for so long ?’. The mistake was understandable though since groups and users are under a similar looking structure. The LDAP admin must have had too many doughnuts at lunch and probably dozed through when configuring the user into the system
2. Authentication failure
This one involves the LDAP too. We built a web app recently that was supposed to check for users under a specific tree node. The logic was to bind to the LDAP and if that is successfull then it must mean that the user exists. Unfortunately the negative scenario to this logic was not tested all that well. The API did not throw an exception when the Bind failed. So this meant that I could login to the application without a password. This also meant I could use any user name I wanted to and login. Even one that does not exist. I was tempted to do some operations as MrBunnyRabbit76 but fixed the bug instead.
3. FTP and the CPU
Now comes this little gem. A FTP process had been scheduled in CRON to run every 10 minutes or so. Its job is to ensure that a local folder and a remote FTP folder are in sync with each other. The program used to do this is lftp with the R switch (for reverse). It so happens that the FTP account gets locked because it ran out of space. The account is unlocked the next day. However the lftp processes did not terminate for some reason when the account was locked.
I come back to check on a web application on the server and it gives a 503 error. hmm… weird I thought. Everything is fine, tomcat is up, apache is up and yet a 503. The problem was that the lftp processes that did not terminate caused the CPU to be overloaded. All other processes were begging for some CPU time. The lftp processes were sitting quietly and drinking up precious CPU power for almost 2 days. Once the processes were killed things went back to normal.
Weird things happen in PROD whether it is your fault or not. No matter how much you test, it always pays to be on your watch