OCR of tables (e.g. WTTs)

You are here: Home > Forum > Miscellaneous > Open mic (non-railway) > OCR of tables (e.g. WTTs)

Page 1 of 1

OCR of tables (e.g. WTTs) 12/01/2023 at 14:18 #150131
DonRiver
Avatar
151 posts
Was wondering if anyone's had a go at using OCR to parse scanned timetables, e.g. those in Network Rail's archive?

Just looking at Tesseract OCR's documentation (tesseract-ocr.github.io) - it's designed for reading paragraphs of text, not tables - wondering if there's off-the-shelf image processing techniques for recognising each column by its borders, cropping it out of the image, and OCR'ing it in isolation… it _might_ not actually be difficult in Python

(named for the one in Tasmania, not in Russia)
Log in to reply
OCR of tables (e.g. WTTs) 12/01/2023 at 16:08 #150132
bill_gensheet
Avatar
1318 posts
No, but just tried to see how it would go:

https://www.onlineocr.net/pdftoexcel

Seemed quite good except for dealing with times ending ½ which went to % or 1/2.
While fixing the % is easy, 11/221/2 is more complicated to get to 11/22 ½

However that was a 2015 file, which looked like it was printed to pdf rather than scanned.

Log in to reply
The following user said thank you: DonRiver