Where Are The Wise Men?

Mike's Ramblings

Hoss Dreams of Software

| Comments

The other night we watched Jiro Dreams Of Sushi which I highly recommend everyone watch, even if you don’t like sushi. Even if you don’t know anything about sushi. Because it’s not about sushi — it’s about Jiro, an artist who is obsessed about quality, and his craft. And his craft is making sushi.

Jiro Ono is 85 years old and owns a nondescript sushi restaurant in Tokyo. His restaurant only has 10 seats, but it costs $300 per seat and you have to make your reservations at least a month in advance. Oh, and it is a 3-star Michelin rated restaurant. Jiro is, in face, the oldest chef to be awarded a 3-star Michelin award. The restaurant reviewer interviewed in the film said, many times, that Jiro’s sushi is the consistently the best he’s ever had. It’s always the best — never was there a time a bit worse than the other. And that is an astounding review. This all has to do with Jiro, who has committed his entire life to making sushi. Meaning, he’s been at this since he’s been 14 years old. He’s at his restaurant every day, overseeing the preparation of the fish, rice, eggs, etc. He will quickly give a criticism when he sees or tastes something under his exact standards — including his own 50-year old son who works there. Jiro keeps a close eye on his customers, noticing if they are left-handed (he puts the sushi in a different place on the plate if it is) as well as making slightly smaller pieces for females. H also admits when his restaurant is closed on state holidays, he doesn’t know what to with himself.

I’ve been saying for years that cooking is a lot like programming software, and I thought many times about that through this film. Jiro said that, if you want to be the best chef, you can never be satisfied, always strive to be better, and you have to love it. These traits, to me, are the same as what makes a great developer. You have to always been learning, striving to make you things better, and you have to love the work. I think the last item is the most important — writing software is hard and takes a certain kind of dedication, nerves, and brain work that, frankly, not everyone is cut out for.

But if you decide that you like this kind of work, then you dedicate your life to it. And, if you want to dedicate your life to it, then you should be constantly looking for ways to get better. Back to Jiro . . he has been making sushi for 70 years. 70 years! And he is always looking for ways to get better. Not necessarily One Big Thing that will change sushi forever, but little increments, like the kind of rice to use, the temperature of the rice when the the sushi is made and served, massaging the octopus for a longer time to bring that much more flavor out of it, finding the best fish mongers to buy from . . the list goes on and on.

I think most software developers (including myself) want to find the silver bullet, the one thing that will make us all better. But, alas, it doesn’t exist. There is no one methodology to follow, no one language to use, no One True Editor or IDE that solves all the problems. We have to get better, in bits of a time.

Really, what I am talking about comes back to craftsmanship. We want to write great software and, after we do that, we want to do it again, but better this time. Never going back, but always improving. Uncle Bob already wrote a great summary of what this looks like so I will just close with telling you to read that. And get started on your personal improvement.

Add a Read-Only Role to Django Admin

| Comments

I was in a meeting where I was asked to give someone read-only access to the Admin part of our application. That was fine -- it was written in Django and Django has really fantastic Admin functionality. So I assumed that it could handle it, no problem. So I said yes.

Of course, after a little googling, I found that that it doesn't support this at all -- you can only give people Add, Change, or Delete permissions. You can make individual fields read-only but, in an ideal world, I needed a whole object to be read-only or not, hopefully determined by Group membership.

My searches didn't give me a lot of hope, but I did find something close [in this post.][]. So I expanded it to look for a Group.

So you used ReadOnlyAdmin to inherit from instead of ModelAdmin for all Admin objects you want to make read-only. Then you also have to add these two properties:

  • user_readonly - list of the fields to be read-only. If you don't put in there, the user will be able to change the Model!
  • user_readonly_inlines - If you have a related Model that you want to display Inline, then you can't add it to user_readonly because it's not part of the Model. You have create a read-only InlineAdmin object and list that here.

Creating a read-only Admin object is simple:

  class MyModelInline(admin.StackedInline):

     model =MyModel


class MyModelReadOnlyInline(MyModelInline):

    readonly_fields = ["label",]

Then you just list MyModelReadOnlyInline in the user_readonly_inlines and MyModelInline in inlines.

To use the ReadOnlyAdmin:

  • Create a Admin Group called readonly.
  • Add the User to readonly and give them full access to the Models you want them to read -- yes, give them Add, Change, etc. Or they can't view them at all.

When the user logs in, they will see the Model and go to individual ones, but none of the fields will be in form fields -- just straight text.

The Many Roads Of PDF Processing

| Comments

The Easy Path

So you have a PDF, or a bunch of PDFs, and want to extract the text out of them? A few years ago, this would have been a horrible task, but life has gotten easier since then.

If your PDF is just filled with text, this becomes really easy:

 pdftotext pdfname.pdf

You can find pdftotext for most operating systems.

How you you know that it's just text? If you open it up in Acrobat/Preview/XPDF/etc and can highlight the text, then pdftotext should work fine.

But if you can't do that, then what the author probably did was make an image and embedded it in a PDF file. You then have to use OCR, which can give you some output which isn't always right. A Google-sponsored tool called [tesseract][] does a good job with this OCR stuff.. I remember that it used to stink, but it doesn't anymore. Simply:

tesseract pdfname.pdf textpat

That will try to do an OCR scan of pdfname.pdf and save each page into a file called textpat.txt.

But, of course, the path isn't always easy.

The Long and Winding Road

which have to be typed in. Lucky me. We have a scanner on-site and I asked if it does OCR, and I was told that it doesn't. I'm even getting luckier.

But I've parsed PDF's before. I should be able to handle it.

I scanned in a few and had the PDFs sent to me. I installed tesseract via Homebrew. The results were. . . disappointing:

$ tesseract pdfname.pdf out
Tesseract Open Source OCR Engine v3.01 with Leptonica
Unsupported image type.

So a quick google shows that either tesseract doesn't have the right libraries installed, or the PDF wasn't well-formed. Since tesseract told me it found [Leptonica][], I have to assume the proper libraries are there. So our scanner is making improper PDFs. This is great.

After some googling and head scratching, I discovered that tesseract works very well on Tiff files. I used Preview to export the PDF to a Tiff and -- success!

 $ tesseract pdfname.tiff out
 Tesseract Open Source OCR Engine v3.01 with Leptonica
 Page 0
 Page 1
 Page 2
 $ ls  out*
 out.txt

Ok, I didn't want to open all of these files in Preview. How to convert them from the command-line? Well, the first tool to think of is convert from ImageMagick. That has always been a tricky road for me nnd, sure enough, the resulting Tif file had horrid resolution. That made tesseract spit out garbage. I searched some more, even for OSX-specific solutions. I found sips which comes with OSX, but most people haven't heard of it. [The usage is a bit arcane][] but it uses the OSX libraries (i.e. the same thing my Preview export used). And, yes, it worked great out of the box -- except that it doesn't handle multi-page PDF's. Ugh.

How does one break up a PDF into pages? More googling, and I found [pdftk][] which is a little swiss army knife of PDF processing. And, hey, it can break a PDF into pages with the burst option! Or, maybe not:

 $  pdftk pdfname.pdf burst
 Unhandled Java Exception:
 java.lang.NullPointerException
 at com.lowagie.text.pdf.PdfCopy.copyIndirect(pdftk)
 at com.lowagie.text.pdf.PdfCopy.copyObject(pdftk)
 at com.lowagie.text.pdf.PdfCopy.copyDictionary(pdftk)

That's not good. A few searches showed someone else with that same problem. The cause? A bad PDF of course! The thing that has started me down this path! But I could extract the PDF a page at a time . . but that's bad to me.

Ok, time to refocus. I thought, "What I am trying to accomplish?" And that was converting the broken PDFs to Tifs so I can run tesseract. So let's focus back on the PDF->Tiff part. I did more searching and found [a StackOverflow entry that talked about the problem I had with ImageMagick and tesseract.][] and someone posted a nice recipe for using Ghostscript:

 /usr/local/bin/gs -o out.tif -sDEVICE=tiffgray -r720x720 \  
-g6120x7920 -sCompression=lzw in.pdf

And I got a Tiff file out that tesseract could process wonderfully! Woot! The bad part was that tesseract took a long time to process this tif -- much longer than the one from Preview. Most of that processing time was done in the first page of my PDF, which is essentially a cover page. How do I get rid of that cover page? Well, back to pdftk:

 pdftk pdfname.pdf cat 2-end output nocover.pdf

So that makes another PDF from the second page on (these PDF's have a variable number of pages).

Running the PDF->Tiff conversion on the nocover.pdf command gave some errors. But then I ran tesseract on the resulting tif file and I had no problems.

Just for fun, I ran tesseract on the nocover.pdf that pdftk created -- same error and the first thing. I figured as much but it was worth a shot.

So, in the end, I wrote a shell script that takes a PDF as a parameter and does this:

oldname=`basename $1`
name=$oldname.pdf

pdf=nocover/$name.pdf
tiff=tiffs/$name.tiff
text=extracted/$name

pdftk $1  cat 2-end output $pdf
/usr/local/bin/gs -o $tiff  -sDEVICE=tiffgray -r720x720 -g6120x7920 -sCompression=lzw $pdf
tesseract $tiff $text

And that, my dear readers, is how to put a PDF through an OCR process.

[a StackOverflow entry that talked about the problem I had with

The Road to Scala

| Comments

To be honest, [Scala][] has been on my periphery for some time now. I had heard of it before, but the first real mention I actually remember was a talk [Ted Neward][] gave at No Fluff one year. I couldn't go to that talk, but I remember him talking about it a few times some other talks he did that weekend.

Fast-forward 2010. When I went to [Strange Loop][], there was some buzz on Scala. Of course, Scala is kinda mainstream for Strange Loop then so there wasn't that much talk on it, but there was buzz. Of course I ignored it.

So, with all that, this is what I knew about Scala:

  • It's statically-typed. Since Python has been my first love, I really can't get into static typing. I see the benefits, but writing code in those languages makes it feel pedantic.
  • It runs on the JVM. I already have Jython as my JVM-alternative of choice.
  • It's kinda functional and kinda OOP. OK, Python is also like that, but that idea weirded me out.

Then we fast-forward to just a couple months ago. I read [this excellent blog post][] and thought he was spot on when talking about the perils of modern day software developers. I honestly know nothing else about Michael Church, but he was spot on in the second part, so how right was he on the first part -- the list of languages?

I already know Python and C. And, OK, not ML and Clojure, but I know what their general idea was. And then there was Scala again. It was this thought that got my attention:

I think Scala is the language that will salvage the 5 percent of object-oriented programming that is actually useful and interesting, while providing such powerful functional features that the remaining 95% can be sloughed away. The salvage project in which a generation of elite programmers selects what works from a variety of programming styles -- functional, object-oriented, actor-driven, imperative -- and discards what doesn't work, is going to happen in Scala. So this is a great opportunity to see first-hand what works in language design and works in language design what doesn't.

And I'm all for that -- there are some good parts of OOP, but a lot of it has become painful. All the styles Church listed have some merits as well as downsides. If you can actually do all of them, then the cream of each style should rise to the top.

Another one of his thoughts grabbed me was:

[Scala] has an incredible amount of depth in its type system, which attempts to unify the philosophies of ML and Java and (in my opinion) does a damn impressive job.

Incredible type system? In a static language? I have yet to see such a beast. OK, the only static typed languages I have used are Pascal, C, and Java, and not one of them are good.

So, not to lengthen this anymore, I decided to dip more than my toe in the Scala waters and see what all this hype was about. After mucking with it off and on for about a week, I have to say that I've impressed. discovering a language since I started banging on Python over 10 years ago.

I'm far from a journeyman in Scala, but I'm getting up to speed on it rather quickly. When I learn something, I need to be a do'er , not a reader. I've been using [Scala Koans][] to play with. It uses [SBT][] to continuously run the tests, which is very cool. When I get to the point of mucking around a little deeper, I use [Scala Test][] with SBT to give me the same continuous feedback.

I recently did [Osherove's String Calculator kata][] to Step 6 in 30 minutes, without any Googling or even too much fumbling. That says something about how easy it can be to get started creating code that actually does something.

Here are some things I have learned to love in Scala:

  • [Pattern Matchers][]. This is probably my favorite. Now that I have groked them, I may never want to write a parser in anything but Scala ever again. I should also state I avoid switch-case statements of any kind in any other language but that structure works really well for Scala's pattern matching. When you use them with regular expression groups, magic happens.
  • [Case Classes][]. It does a lot of the boiler plate of making objects for you, and you get a sane equals to boot. And, as the link says, they go nicely with pattern matchers.
  • The static type system does make sense, and does not annoy me. Look numbers? Well, since we are filtering it, it must be a collection of some sort. Is it a List or is it an Array? Then what is negatives? Well, since we are using filter, it must be the same kind of collection that numbers is. But my favorite part is this: it doesn't matter. I know how negatives should behave, because it should behave just like numbers does. This makes sense to me, so much so that a type declaration for negatives becomes superfluous (hello Java ...)

Now there are things that have annoyed me in Scala. But I'm a beginner so I think some of those things will iron themselves out. I've been coming up with web app ideas that I can start writing in [Lift][], which probably says something about how how I feel about learning it.

Slicing Some Python With Emacs

| Comments

I have a new job and it's quite probable that I will be doing Python for a lot of it. Which suites me just fine.

. . . except that I've been out of the loop in a while. Sure, I have written some Python in the past five years and [some of it has been substantial][] but I feel out of the loop. Most of my simple scripts have been done in good ol' Emacs and bigger projects have been done in [Intellij IDEA][] with [their amazing Python plugin.][]

As I started at the new digs, I installed Intellij on my shiny MacBook Pro, turned on Emacs mode . . . and was underwhelmed. I forgot that Emacs mode in Intellij on Mac leaves a lot to be desired -- C-Del for Cut, Alt-P for paste? Ugh. A quick search shows that [I'm not the only one complaining, but it's not fixable.][]

I thought about Emacs and what I would miss about running things in Intellij IDEA. The biggies were:

  • Syntax checking
  • Running unit tests
  • Auto-refactoring (Extract Variable, Method, etc)

These are things that are supposed to separate an IDE from a text editor. However, Emacs is an elegant weapon from a more civilized age. So the hunt is on to see what others have did while I was on my hibernation from Python.

I've tried to use the [Rope library][] in the past and found it hard to setup. But I did note that it's still actively developed and so I tried find to some example configs to steal borrow from. That's when I found Gabriele Lanaro's excellent [emacs-for-python][] collection. It included Rope, [YA Snippet][], and other goodies, all configured to work together in harmony.

I forked it, cloned it, and had a few problems, so I fixed them and Gabriele merged them back in. It still didn't have unit test support, but I found [nosemacs][], which runs [Nose][] on the Python unit tests.

In searching for something else, I stumbled into [virtualenvwrapper][], which are some helpers around the most excellent [virtualenv][] utility, which creates a clean environment for Python development. These are used in emacs-for-python, so I put it in as well. [I then stumbled into this post,][] which explains how to use the hooks in virtualenvwrapper to control Emacs. Woot!

So now my workflow is like this:

  • type 'workon something', which will put my prompt in my "clean room" Python environment for the project. My Emacs has also switched to that environment, including using that version of the Python interpreter.
  • In Emacs, type C-c m, which will run and report on all my unit tests in my current module
  • In Emacs, type 'C-c r ` to extract a new variable. Other commands exist for class, method, etc.
  • type deactivate and my prompt moves away from my clean room, and my Emacs leaves too.
  • when I go back to work on something Emacs will remember the last buffers it worked on.

I put all these changes into my branch of emacs-for-python, and Gabriele has already pulled them in. They are available in HEAD on [emacs-for-python][]