Never trust an environment you do not fully control

This is usually a suggestion you hear when it comes to sensitive information, personal data and security related tasks. I learned this again over Christmas while working on a small side project and while giving a service a friend of mine is working on a try.

Before we start let me warn you: Do not take this as a guide on how to troubleshoot a problem. It is not a good idea and you never should do it this way if you are working on a serious project. What I did here was a mix of laziness, not being at home for 4 days and we are talking about Christmas, so I was running from one family event to the next one and time to troubleshoot unimportant stuff was pretty rare - I had 2 or 3 tries a day without scarifying any family time, but I really wanted to know if it works.

The service I was trying sounded pretty nice - a kind of Heroku alternative, based on what you would expect (docker etc.), but with the difference that you bring your own API keys for a cloud provider and all servers etc. belong to you - they just provision them and charge you some $ a month for the service which means you get a lot saner pricing. I was told everything is documented and there would never be any wired behavior since it is just one process running and not doing much.

I hope when she reads this she will not be too mad because I likely messed up her pitch and made it sound a lot lamer and useless than it does when she is pitching you. Since she does not even have a name or anything I also cannot link you to the obligatory big-header-no-content marketing site. I will let you know when there is one.

She set me up with a test environment and I requested a 512MB Digital Ocean box for "workers". Everything was pretty painless, pushing to the services git endpoint worked and some minutes later I got a "deployment succeeded" notification. The wired stuff began when my management command ran out of memory.

The management command is quite trivial. It is iterating over all keys in an S3 bucket, checks if one of them changed and if this is true downloads the content, unzips and stores the content in the database.

The script takes quite a bit of time for the full production dataset (roughly 26 hours), so I only used 2% of the dataset as test data. The code is fully test covered and used real data for tests, so I didn't expect too many stupid errors to make it to the test server. One of the reckless assumptions that turned out to be true - lucky shot :)

This is the first time I am using boto with lots of keys and in memory fetching and unzipping, so I made a quick list with my first thoughts what could possibly be wrong

a file that is broken / different / too big
me messing up the in memory fetching and unzipping

My first attempt was to run the script a few times and look at the key on which it is failing. It was always a different one and the one that failed on the second run passed on the first one.

Next I just tried running it from top to bottom always doing only one thing at a time. The first test was to iterate over all keys and print the name and etag, which worked fine and was even pretty fast compared to the expected runtime.

Next I fetched all keys and checked if they changed - and the script was killed, again. So my second theory was not even worth testing, something broke way earlier.

 def logfile_changed(self, name, etag):
     logfile, created = LogFile.objects.get_or_create(
         website=self.website,
         name=name,
     )

     if created is True:
         logfile.hashed = etag
         logfile.save()
         return True

     if logfile.hashed != etag:
         logfile.hashed = etag
         logfile.save()
         return True

     return False

Quickly running it wit everything after get_or_create commented out showed that it is still crashing. To get boto out of the picture I replace it with a for loop

 for x in range(1,9999999999):
      self.logfile_changed(self, x, x)

Guess what - out of memory. So we have a for loop which just runs get_or_create that runs out of memory. Anyone using Django for more than the duration of a bootcamp will likely scream "set debug to False you idiot". But guess what - debug was set to False. At least I thought it was.

But from here it was pretty obvious - it has to be the debug flag. Local testing didn't uncover any problems, my server at home completed the import of the full production dataset without going above 150MB for the Python process and suspecting that Django is broken is nearly an as good guess as thinking of a compiler bug.

After a short mail describing the problem it turned out that no matter what environment variables I set, the system just refuse to overwrite the ones set by the service and somehow everything ends up in one big pile of environment variable mess. Please do not ask me how they do it, I was too scared to ask. In this case they use the presence of DEBUG, too, to activate the debug mode. Which actually explains why the script was running out of memory. Hardcoding DEBUG=False in settings.py fixed it. It was a bit troublesome to figure out what was happening, but at least she got a real world example why it is a bad idea to toy with the customers environment without documenting it or showing warnings.

The lesson is the same as always: If you do not have full control over an environment - in this case it was not even possible to get a shell or anything like that - you are on the best way to a world of pain when you run into a problem. Even if you think you know and understand the problem, you will eventually not be able to troubleshoot or fix it by yourself. While this bug was relatively easy, there are harder ones out there you can run into. But hey, obscure bugs are the funny ones :)

posted on Dec. 30, 2014, midnight in backend