A tale of five bugs: Django Intranets on Windows
I'm in Ethiopia working with the ATA to help install their Intranet. (Aside: Ethiopia has great, cheap food and beer, and fairly bad Internet.) Because they really want to run the server locally, and they only have Windows servers, we agreed to try to make it run on Windows.
We haven't done many deployments of Django applications on Windows, and it always seems to be a challenge. Integration with Microsoft IIS seems to be a non-starter. Apache works, but getting the right combination of Python, MySQL, Apache and mod_wsgi installed is very challenging. I thought our worries were over when we discovered BitNami djangoStack, which installs a well-tested combination of all the above, and we successfully deployed a project on this stack.
So I didn't worry much about how to get our newest application running on Windows. How wrong I was. This little jolly turned into a two-day exercise in frustration that pushed all my Python, C, Java, Windows and SQL skills to the limit.
- Learning 1
- Never assume that deployment on a new platform will be easy. Especially on Windows. Software deployment is hard.
I had some problems with our deployment script, which is supposed to make deployment easier by scripting the steps necessary to install and configure the applications' dependencies. This is very useful for allowing users without Django expertise to deploy applications, by reducing the number of steps required, but in many places they assume a Linux environment. I think I've fixed all of these now.
One of the tasks of this script is to create a virtualenv which allows an application's Python dependencies (Python extension libraries) to be installed in a subdirectory, where they can easily be removed and reinstalled. This even works for libraries that require compilation, if you have a compiler installed. All our Linux servers have
gcc installed by default, so this hasn't been a problem for us, except for the above-mentioned first Windows deployment. At that time, we discovered some binary packages that could be installed on the Windows host system, outside the virtualenv, to provide these features. But they didn't include everything that our Intranet needs, so I had to add PyWin32 and PIL to that list.
Then I discovered that our migrations wouldn't run. There are several bugs with migrations on MySQL:
- Renaming a column which is a foreign key: "MySQL doesn't bother to re-point the foreign key constraints when you change a column name."
- Making a foreign key nullable after creation: MySQL 5.5 broke backwards compatibility, the fix is in South 0.7.4.
- You also can't rename a column that has a UNIQUE key on it, and apparently can't remove the UNIQUE key either, because South complains that it doesn't exist.
I ended up having to completely abandon the migrations in one of the apps that makes up the Intranet. Luckily nobody's using it yet, so it won't cause problems.
The second issue was with the Apache Tika integration. I already spent several days working out how to integrate a JVM into the Python server process directly. I decided to abandon that approach for Windows, because I didn't want to deal with the issues of compiling custom DLLs and embedding JVMs on a platform that I know less about, including potential threading conflicts between Apache, Python and Java in the same process.
Tika has a standalone server mode with a web service API which turned out to be fairly easy to call from Python:
from httplib import HTTPConnection conn = HTTPConnection('localhost', 9998, True, 30) buffer = open(path) # wrong! see below conn.request('PUT', '/tika', buffer) response = conn.getresponse() if response.status != 200: raise Exception("Unknown response from TIKA server: %s: %s" % (response.status, response.reason)) return response.read()
This server runs as a stand-alone application, and unlike Unix, Windows makes it fairly hard to set up such an application to be started at boot time, with no users logged in to the server. I had assumed that for a Java application, this was a solved problem, but I was wrong. As usual, Stack Overflow was very helpful, turning up a bunch of suggestions, several of which I tried and failed to implement:
- srvany.exe abandons the child process, which means that you need to use Task Manager to kill it if you want to restart it. It remains an option, but not ideal.
- sc.exe is bundled with Windows XP, but also abandons the child process.
- Java Service Wrapper is not free for 64-bit Windows, and I didn't want to become dependent on a 32-bit JVM because that might not be possible in a future deployment, but it remains an option, if the 32-bit version runs on 64-bit Windows (not tested). Also, "the configuration file can get a little crazy"!
- YAJSW aims to clone JSW without the license restriction, but I couldn't understand how to use it from the documentation.
- Apache Commons Daemon requires launching a separate process to tell the server process to shut down. Tomcat supports this with its control port listener, and so does plain Jetty, but Tika's embedded Jetty doesn't provide such a listener, and I didn't fancy writing one.
- WinRun4J looked ideal, and ran the application absolutely fine on the command line.
- JavaService, apart from its obnoxious "free registration required" to download, was really difficult to work out how to use with the limited and broken documentation. When the command-line tool failed to accept any of the (documented) arguments I tried to pass to it, I gave up.
- I didn't like the idea of writing a service wrapper in Perl or .NET and adding another dependency.
- WinSW is written in .NET, I didn't investigate further.
- FireDaemon would probably work, it's not free.
- JSL (Java Service Launcher) would probably work too. I couldn't find the documentation when I first looked at it.
- JSmooth might also work, I missed it the first time.
At this point I was left with only WinRun4J as an option. I discovered that you have to write a service class, using their JAR, to run your application as a service, so I wrote one. I had to bodge the classpath to compile it, since Tika uses Maven and WinRun4J isn't available in Maven repositories.
Then I wasted an hour with an undocumented problem when I added the
service.class to my
winrun4j.ini without removing the
main.class entry, causing this misleading error message:
[err] Could not find service class java.lang.NoClassDefFoundError: org.apache.tika.server.TikaServiceWinRun4J
After solving that one, I came to a problem that I couldn't solve:
SEVERE: Can't start java.lang.NullPointerException at com.sun.jersey.core.spi.scanning.PackageNamesScanner$ResourcesProvider$1.getResources(PackageNamesScanner.java:170) at com.sun.jersey.core.spi.scanning.PackageNamesScanner.scan(PackageNamesScanner.java:135) at com.sun.jersey.api.core.ScanningResourceConfig.
(ScanningResourceConfig.java:80) at com.sun.jersey.api.core.PackagesResourceConfig. (PackagesResourceConfig.java:104) at com.sun.jersey.api.core.PackagesResourceConfig. (PackagesResourceConfig.java:78) at TikaServiceWinRun4J.serviceMain(TikaServiceWinRun4J.java:83) [info] Service method completed...
I spent much time wondering how to debug this, whether I could attach a debugger to a WinRun4J process while it was still starting up, etc. I gave up at that point and went home, still turning the problem over in my mind. Then it occured to me that the Tika web service is implemented using what appears to be a standard Java servlet, and if I could install that servlet into a servlet container like Tomcat, which has its own service integration with Windows, that would solve the startup problem. And it did, with the help of the Maven WAR plugin which made it trivial. (At times like this I love Maven; at other times I hate its complexity.) I submitted a patch to the Tika developers in the hope that this will be adopted as a standard deployment method for Tika server.
Along the way, I discovered that the standalone Tika Server (command-line JAR) is pretty badly broken: it appears to work, but it can only extract content from Ogg Vorbis files and nothing else! I didn't manage to fix this properly, but I did come up with a workaround, described in that ticket, which disables the Ogg Vorbis support and re-enables support for other formats that are more important to us, like Microsoft Word and Excel.
So finally, I was able to run the Django application under Apache on Windows, and log in and upload documents. But wait, what's this? Error 500 from Tika? WHY?!
Tika wasn't logging anything useful, so I had to fire up Wireshark to inspect the conversation between the client (the Intranet application) and the server (Tika). And thanks to Microsoft's infinite wisdom in removing the loopback interface when they ripped thir TCP/IP network code out of BSD, it's not possible to listen for loopback traffic on Windows, so I had to point the Intranet on Windows at a Tika server on another machine to capture the traffic.
Then I discovered that httplib was telling the server that it was going to send 7168 bytes, but only sending 6. Tomcat waited 20 seconds for the rest, and then timed out the request, returning an Error 500 as I saw earlier, and helpfully not logging anything. So why was httplib not sending all the data? I stepped into it with a debugger and discovered a problem with one of the lines I wrote above:
buffer = open(path)
On Unix this works fine, but on Windows it opens the file in ASCII mode by default, which stops at the first ^Z (hex 1A) character in the file. And guess what's at byte 7 of every Excel XLS file?
And so onto the fifth bug. Although the intranet appeared to work fine, it would sometimes pop up a scary error message on the server:
Program: C:\PROGRA~1\BITNAM~1\apache2\bin\httpd.exe R6034 An application has made an attempt to load the C runtime library incorrectly.
Past experience told me this was probably due to one of the Python extensions missing an embedded manifest, which would be required to tell Windows which version of the Microsoft C Runtime it's supposed to use. However I had great difficulty tracking down the culprit. A bug in the PIL installer turned out to be a red herring, and while we've seen this problem in LXML in the past, it wasn't there this time.
Eventually, after much soul-searching, I downloaded Process Monitor and configured it to monitor all file activity from
httpd.exe processes. I found out that Apache was loading
C:\Program Files\BitNami DjangoStack\Apache2\Bin, which is a strange place to find it, because normally it would be installed in
C:\Windows\WinSxS. I discovered manifest files in that directory that might confuse Windows into loading this copy instead. That can potentially end up with multiple copies of the DLL loaded, which can result in random crashes, but in this case I think it was detected early.
Microsoft.VC90*.manifest files from that directory didn't solve the problem completely, because there was another copy of them in
C:\Program Files\BitNami DjangoStack\Python. But with those second copies removed as well...
The intranet now runs on Windows!