Friday, September 21, 2012

More Than One-Fourth of the Puzzles Litzed, Poll, and Litzing Methods Using OCR

I am thrilled to announce that, as of this week, we reached and passed the 4,000 mark and have now finished more than a quarter of the entire project (4,056 puzzles)!  This is amazing—I never imagined we'd be this far along in the project so early on.  We're also rapidly approaching 1982, with just six more months of puzzles to send out from 1983!

In other news, I've added a poll feature to the website.  The first poll asks how long it takes to litz a daily puzzle.  Several responses have already come in; if you'd like to respond before the poll expires next week, you can find the poll under the litzing thermometer to your right.  I'm planning to post a new poll every week in between my longer regular blog posts.

Litzers especially may be interested in the following:  Joe Cabrera and Martin Herbach, two of our litzers, have come up with innovative methods for litzing puzzles with optical character recognition (OCR) software!  Though the final puzzles need to be proofread for obvious mistakes (such as "A Tunrier" instead of "A Turner," for TED), the OCR methods seem to be quite fast and accurate.  Joe reports that it takes about an hour and fifteen minutes for him to litz an entire batch of puzzles, including the Sunday puzzle; Martin estimates that he takes just forty-five minutes (over several sittings) to litz a batch with his method!  I've summarized both methods below, paraphrasing slightly to make them less technical:

Joe Cabrera's Method

1.  Create new Across Lite text file templates based on the dates and authors' names.

2.  Pull each PDF into Photoshop (assuming everything was originally scanned at 300 dots per inch).  Rearrange all the columns [answer grid and clues] into one long one, with the answer grid on top. Delete the lines in the answer grid to leave just the letters.

3.  Run each Photoshop file through the online OCR reader to turn it into plain text.


4.  Run a script to remove all the clue numbers and use them to separate the clues so they're not all one big paragraph.  Also use the script to clean up extra spaces and garbage, format underscores and ellipses properly, make all quotes "dumb," and so on.  Proofread the clues.

5.  Clean up and proofread the answers.

6.  Cut and paste the proofread clues and answers into the Across Lite text files created earlier.

7.  Drag the text files into Across Lite and save them as standard .PUZ files.


Martin Herbach's Method

1.  Use an online OCR reader to convert from PDF to rich text format (RTF).

2.  Copy and paste the clues from the RTF file into a Notepad document.

3.  Manually fix missing numbers, broken lines, accented characters, ellipses, and fill-in-the-blank clues.  Save as a text file.

4.  Read the text file into Microsoft Excel, specifying "Space Delimited" and "No Text Indicator."

5.  Delete Column A (which consists of the clue numbers).  Save the Excel file.

6.  Upload the Excel file to Google Docs and then download the Excel file from Google Docs as a text file (because Excel screws up quoted strings and commas by inserting extra quotes, which Google Docs doesn't).

7.  Read the text file (actually a .tsv file for tab-separated values) into Microsoft Word.

8.  Replace all tabs with spaces.  Select the entire Word document and copy it.

9.  Paste the contents of the Word document into an Across Lite text template.

The only minor flaw in these methods (other than that litzers must be very tech-savvy!) is that the OCR service Joe and Martin use only allows fifteen file conversions per hour.  This isn't a huge problem, though, since each batch only contains seven puzzles.  Another very interesting idea Joe tested was speaking the answer grid into his smartphone, though he found that doing this actually took more time overall.  Nevertheless, the ingenious methods both Joe and Martin have come up with to automate the litzing process are very, very cool!

Today's featured pre-Shortzian New York Times puzzle was constructed by I. Judah Koolyk.  It was originally published on October 8, 1983, and was recently litzed by Mark Diehl.  This amazing daily puzzle contains twelve rebuses of STAR!  My favorite theme entries are **** RATINGS (yes, four rebus squares in a row!) and *** GENERAL.  Even today, stacked rebuses with this many rebus squares are extremely challenging to construct.  Not surprisingly, the nonthematic fill (consisting of fewer entries than usual) has a handful of undesirable entries (ESNES, ITEAS, etc.).  All things considered, though, this is an admirable construction that feels way ahead of its time!  The answer grid (with highlighted theme entries) can be seen below:


Today's featured pre-Shortzian entry is NEBEL.  According the the Ginsberg database, NEBEL has never been reused in a Shortz-era puzzle.  NEBEL originally appeared in the June 2, 1985, puzzle by Joy L. Wouk, which was recently litzed by Stephen Edward Anderson.  The original clue for NEBEL was "Ancient stringed instrument."  Webster defines a nebel as a variant of nabla, which it lists as "an ancient stringed instrument, probably like a Hebrew harp of 10 or 12 strings."  Not surprisingly, nebel comes from nÄ“bhel, the Hebrew word for "harp."  Below is one interpretation of what a nebel might have looked like:


Image courtesy of the Potsdam Public Museum.

2 comments:

  1. OCR? Why?! All the fun of litzing is looking over the grids and checking out the clues!

    ReplyDelete
  2. Oh, believe me, the OCR is not perfect, and the answer grids don't scan as well as the clues; you still have to look everything over. But it sure cuts down on a lot of typing and I'll bet it still results in less mistakes.

    ReplyDelete