Friday, February 8, 2013

Labels, Over 7,600, Howard Barkin's OCR Method, Funny Typos, and New Poll

Important news:  I've added a new feature to the blog—labels—to make finding posts about particular subjects or people easier.  The labels are at the bottom of the column to your right.  If you want to read all the posts that include the label "XWord Info," for example, click on "XWord Info" in the labels field, and all those posts will appear.

I can hardly believe that we've now litzed more than 7,600 puzzles!  We're rapidly approaching our goal of the halfway point—8,113 litzed puzzles—by the end of February.  Awesome job, everybody!

Some of you have noticed the prodigious number of puzzles Howard Barkin has been litzing lately—nearly 400 since the beginning of January!  Howard has devised a very efficient way of litzing using optical character recognition (OCR) technology.  Back in September, I wrote an OCR post that described the methods of two other litzers—Joe Cabrera and Martin Herbach—who were using this technology to litz.  Now Howard has generously taken the time to present the details of his OCR method in response to some questions I asked him.  Read on:

You've litzed an amazing number of puzzles in a very short period of time using a combination of OCR and manual entry.  With this process, litzing a week of puzzles was taking you about 90 minutes when you started.  Has the process become even faster since then?

First, OCR = "optical character recognition," aka a program designed to analyze printed characters and convert to text, which can then be electronically edited.  (Whew.)

Now, with solving and litzing experience comes more efficiency.  Depending on the quality of the original files, a week of puzzles can be accurately scanned and litzed (converted electronically) in the range of 45–90 minutes.  That depends upon how readable the text is, how big the Sunday puzzle is, whether the baby wakes up, etc.

It's been fun to use my solving and technical experience to figure out the most efficient way to improve the process—remember, improvement is not about speed but about transferring these puzzles accurately.  That is the best way to honor the constructors and editors and ultimately complete the project.

You've mentioned using OCR to scan the PDF clue columns and Word macros to copy and paste individual clue text into Crossword Compiler.  Can you describe this process in detail in case other litzers want to give it a try?

I can try.  Again, this is about getting this project done efficiently and at the best possible quality.

Never sacrifice quality for speed.  Also, this is for PC-compatible use.  Mac users likely have similar apps available.  Use what works.

Have the following stuff (technical term . . .) available:
  • A text editor with macro recording capability.
  • Crossword Compiler or your favorite puzzle editor, of course.
  • You'll need to have previously used your preferred OCR program to have scanned the clues of several PDF or graphic puzzle files to this point.  In each file, look carefully for the patterns in the scans that result in incorrect text.  Each scanner may result in different quirks.  This takes time and experience.  Be patient, grasshopper.

O.K.  Here we go:

1.  Find the common issues resulting from your OCR scans.  For the OCR app I use, each clue is often separated by extra blank lines, fill-in blanks may scan as "---", letter T as 'I' surrounded by single-quotes, etc.

2.  Once you see such patterns, record macros in your editor to find and replace each instance of the wrong scans with the right ones.  I then save these macros.
  • Use the macro editor (in Word, this is the Visual Basic macro editor) to combine the macros into one master macro, and assign this macro to a keyboard shortcut key.

3.  Ensure that the patterns you are finding and replacing do not create "false positives"—so if you see that lowercase Ls are sometimes scanned as number 1s, don't simply replace all the lowercase Ls.  You'll create as many problems as you fix.  Care is the key word here in selecting which text to replace.  Wildcards can be helpful as well.

4.  To save extra time, create some additional macros for other frequent data-entry tasks.  I use other macros to enter fill-in blanks "___" where missing and to create end-of-clue ellipses in the New York Times style (. . . ").  It doesn't sound like much, but over the course of many puzzles it saves time (and your wrists!).

5.  After scanning all of your puzzle clues (this may take one or multiple scans, depending on the file), run your master cleanup macro on the file.

  • At this point you have a cleaner text file with many of the common scan errors corrected.

6.  You still need to manually compare and edit each clue to match the actual clues.  Some manual tasks:
  • Add those fill-in blanks, missing letters, words, ellipses, incorrect letters, etc.
  • Style corrections:  missing spaces, ensuring that words that break in the middle to a new line with a dash (such as "dis- [new line] enchanted") are removed; the Times puzzles used dashes with impunity, so you have to make some judgment calls on these, but usually it is clear.
  • I also use a handy list of Alt+<number> shortcut codes to type in special accented characters (é, à, etc).  In Windows, the "charmap" program lets you copy these as well.  This helps with those pesky foreign word and name entries.

7.  While editing, manually right-justify all clue numbers (or remove manually).  If you right-justify the numbers as shown below, you can then use the editor's block-select feature to highlight only the number columns on the left and delete them in one action:
   7
   8
   9
 10

8.  In the crossword editor, use the Import function and select the text file.  (CC8: Clue->Import.)
  • If the file contains exactly one line per clue, all clues should import nicely into the clue review window.
  • If not, an error message will display; then you must edit the text file again and compare to the clues.  Sometimes the initial scan will omit a clue, which you must add in manually.

9.  Review each clue one more time manually and edit as needed.  Sometimes a lowercase L will scan as the number 1, or zero as letter O, which is easy to miss.

10.  Save the file.  You're done!

This is actually a bit easier once you have a rhythm.  With an hour I have before sleeping, I can often finish a set of puzzles in this way, if the files are clean.  I have had to manually enter a full Sunday puzzle occasionally, though.  Not every file will scan cleanly.

Any questions on details here, feel free to ask.


Thanks so much, Howard—this is a very impressive system that has proved to be exceptionally accurate!  If anyone has questions or comments about Howard's OCR method, use the Comment feature below or send me an e-mail, which I'll forward along to Howard.

Speaking of finding mistakes, it's been a while since I've posted funny litzing typos, and my list is now several pages long!  So here are ten of the more humorous ones our proofreaders have caught lately:
  • A clue for DIG was typed as "Verbal trust" instead of "Verbal thrust"
  • A clue for ALFA should have been entered as "Esparto grass" rather than "Esperanto grass"
  • A clue for OPA was accidentally typed as "W.W. III agency" instead of "W.W. II agency"
  • A clue for TOE was supposed to read "Foot digit" instead of "Foot delight"
  • A clue for ALAR should have read "Pteriod" but was accidentally entered as "Period"
  • A clue for BIRDS OF A FEATHER, which was supposed to be "Group in a flock," was erroneously typed as "Group in a frock"
  • A clue for ARNE was typed as "'Rue Britannia' composer" instead of "'Rule, Britannia' composer"
  • The grid entry LET ONE'S HAIR DOWN was mistakenly entered as LET ONE'S TEAR DOWN
  • A clue for ARE was accidentally entered as "Modem art" rather than "Modern art"
  • Last, but certainly not least, a clue for FRED was supposed to read "An Allen who wrote 'Treadmill to Oblivion'" instead of "An Alien who wrote 'Treadmill to Oblivion'"

Today's featured Will Weng–edited puzzle was constructed by Shipp (probably Dorothea E.).  It was originally published on June 22, 1974, and was recently litzed by Mark Diehl.  The first thing that struck me about this puzzle was the shape of its grid, which reminded me of the crosshairs of a rifle.  I didn't think much of it until I noticed the entries SNIPERS and REVENGE in the grid!  I wonder if this was just a coincidence or if the shape/fill was intentional.  Will Weng mentioned somewhere that he published several puzzles constructed by prison inmates (shudder!). . . .  I've just posted a new poll, which asks you to vote on whether you think the puzzle has a rifle-related theme or not.  I'll post a recap next week!

Anyway, the rest of the fill is a mix of great entries and not-so-great ones.  HOBOKEN, LOZENGE, FANATIC, STAUNCH, and CADENCE are fantastic, but SCR (clued as "Movie scenario: Abbr."), API ("Nepal mountain"), and NPS ("Certifiers: Abbr.") are pretty bad.  Nevertheless, this is a remarkably clean "themeless" puzzle!  The solution (with highlighted "theme entries") can be seen below:



Today's featured pre-Shortzian entry, SESQUIDUPLICATE, originally appeared in the July 7, 1976, puzzle by Sidney Nelson, which was recently litzed by Nancy Kavanaugh.  Not surprisingly, according to the Ginsberg database, SESQUIDUPLICATE has yet to be reused in the Shortz era.  The clue for SESQUIDUPLICATE was "Having a ratio of 5 to 2."  This word is so obscure that it isn't even listed in Merriam-Webster's Online Dictionary!  Dictionary.com, however, notes that the word was listed in the 1996 and 1998 unabridged editions of Webster's.  The definition was "Twice and a half as great (as another thing); having the ratio of two and a half to one."  Sesquiduplicate sounds awesome, but I don't think I'll ever be able to use it in a sentence!  Below are some pictures of a sesquiduplicate ratio:


Images courtesy of Linux-Support.com and Wikipedia, respectively.

No comments:

Post a Comment