How to extract text from pdf in python


I’m trying to get text extraction from pdfs working on lambda for a little fun project of mine.

Now, there are a lot of possibilities how to extract text from pdfs using python but nothing really worked for me:

  • pypdf2 just returned newlines for my test pdfs
  • tika (which calls apache tika) was too slow (needs to start a java server first on localhost)

Finally I ended up using xpdfs pdftotext. Sadly I couldn’t install xpdf on AWS EC2 (Amazon Linux), so I needed to compile it, but turned out it is quite straigtforward:

How to remove browser shortcuts interferring with cloud9


I just started playing around with cloud9, particularly because it looks like the ideal IDE to develop lambda functions.

One thing which bothered me from the start: Emacs keybindings such as ctrl-n for next line won’t work because this makes the browser (in my case firefox) open a new window. Similarly, ctrl-tab would cycle through the browser tabs instead of cycling through cloud9 tabs. I’d like to have all shortcuts available for cloud9, so ideally cloud9 would run in some minimalistic browser window.

Because Firefox and also Chrome don’t support removing shortcuts I found this nice solution:

Scan with raspberry pi, convert with aws lambda to searchable PDF

What we're building today :)

I have long dreamed for a setup which lets me just press the scan button on my scanner and — without any further input — uploads it as a searchable PDF onto some cloud drive. Thanks to the good support of scanners by SANE and the ease of use of AWS lambda it’s actually quite easy (judging to the length of this post it looks like quite a task, but in the end it is straightforwards and is — surprisingly — quite free of hacks).

In this solution you:

  • set up SANE on your raspberry pi 3 so it scans your document
  • set up scanbd to detect the scan button
  • set up a S3 bucket for uploading
  • set up a lambda function which uses tesseract to create a searchable PDF
  • (optionally) set up google api to store the PDF to google drive

What you need:

  • Raspberry Pi 3 (I guess the other models serve equally well)
  • Paper scanner with a “scan” button which is supported by saned
  • an AWS account

Personally I’m using Raspbian Stretch Lite as OS on my Raspberry and a Fujitsu S1300i.

Before you start: you might just want to wipe your pi and start fresh. Takes you about 15 minutes extra, you can follow my howto so you can do that headless (without attaching monitor/keyboard to the pi).

How to check for broken links in markdown files

dead link

Having blog articles up >10 years needs some kind of tool to check for dead links.

Having googled a bit I didn’t find anything convincing. So I just created a very dirty solution which did the job for me.

You start it with

python3 path/to/md/files/

and it iterates over all .md files in path/to/md/files for links and images in your articles, sends a HTTP HEAD request and prints everything which does not look right

How to set up raspberry pi headless with ssh and wifi

raspberry pi 3

Setting up raspberry pi is a bit tedious when doing it over attached monitor, keyboard and mouse (I usually don’t have those around anyway, being laptop only at the moment), so here’s a good and easy way to get an installation directly from your laptop, making the pi automatically join your wifi and enable ssh:



I found that is a very easy way to flash, in order to do so:

  • Install etcher (available for linux/osx/windows)
  • Download image: I chose the raspbian lite version from official
  • Open etcher etcher (on linux just unzip the and open the executable therein)
  • Insert sd card (don’t mount it yet!), watch that etcher now detects that new card in the middle
  • Select image e.g. ~/Downloads/ and flash (for linux i3 users: you’ll get a polkit error. You’ll need to start a polkit agent, e.g. /usr/lib/policykit-1-gnome/polkit-gnome-authentication-agent-1 before flashing)

How to streamline cd ripping without tagging track data

CD tower to rip

Since we recently stopped using Spotify (mainly because I think having everything at your fingertips influences brain in a negative way) we switched to borrowing CDs from the local library (which, in our case is only 200m away from our house).

Now, because the kids get CDs at least once a week, I needed a way to quickly import those CDs into our Sonos system without too much hassle. Since the kids only borrow children stories (spoken audio) which often are not on MusicBrainz, I needed an easy way to tag them myself. Because I don’t care about tagging every single track (because you usually listen to a story start to end anyway), I wanted to have a streamlined process. The following script does:

  • Rip the CD and convert it to m4a (AAC encoding, slightly better compression than mp3)
  • Eject the CD
  • Ask me for the album and artist name
  • Opens chrome so I can choose an artwork
  • Convert the artwork to JPG in a reasonable size
  • Copies the music to the directory on my NAS
  • Triggers Sonos to update the music library

How to mass convert mp3 files to aac (m3a)

Since aac has a slightly better compression rate than mp3 (and, geez, mp3 was standardized 1992, there must be better standards nowaday), I decided to mass convert my music library from mp3 to aac

Won’t the quality be just awful?

Of course, re-encoding sounds like a terrible idea. You’re converting from one lossfull format to another, similar when mass-converting gifs to jpegs. But on the other hand, for my setting it was just good enough. The library I converted we listen to at home over Sonos or in the car. So in both settings there are only half-decent speakers. Also, many of the tracks I converted from audio cassettes, so they were in a bad quality already. You can certainly play with the bitrate, but if you have invested into an expensive stereo you’d be better off converting from a lossless source.


First things first: Almost everything in life is easier if you first reduce it to the absolute necessity. I recently spoke with a colleague who told me she has converted her whole CD stack into mp3 without first trashing the CDs she never listens to. That’s insane.

First, reduce your collection to, say the albums you listened in the past 12 months. Make it 24. But anything beyond is just an overly burden you don’t need to carry.

No words! I just want to copy-paste

Here you go: Once, you haved cded into the directory with the mp3 files you want to convert, do this:

detox *.mp3
ffmpeg -i *.mp3([1]) artwork.jpg
for i in *.mp3; do ffmpeg -i $i -c:a libfdk_aac -b:a 128k -vf scale=1280:-2 ${i/mp3/m4a} done
for i in *.m4a; do AtomicParsley $i --artwork artwork.jpg --overWrite; done
rm artwork.jpg && rm *.mp3

Feature phone with tethering (Nokia Asha 302) - tune out of Internet

Since the days of the iPhone finding the right way to handle the mobile phone is challenging to me. Working in web programming and being a father and husband at the same time is means that I need to be able to connect to my colleagues at work when not at my desk while being able to “tune out” while being with my family.

My latest try at the problem was to buy a “feature phone” which was good enough to support the bare minimum (apart phone calls and SMS this is WhatsApp and internet tethering for my tablet/laptop) but was dumb enough that it would not tempt me away to check any news/mails while being with my family. I came across the “Nokia Asha 302” which is no longer produced but IMO was selling bad enough that there are some in stock in the most countries (at least that was in case here in Switzerland).

I’m on this “feature phone” now since a few days and must say that I’m quite happy with it. When I am at work I just take my android tablet (nvidia shield k1) with me, if I want to tune out I can leave the house with my mobile only. As it is an outdated phone there are some tweaks you need to do which I documented below:

How to migrate your wordpress to tumblr. Including images and comments.

So I’ve decided to move my wordpress blogs to tumblr. Although apparently TechCrunch thinks that’s a bad idea. And although Moritz Adler would kill me for that. (Although: He doesn’t have a personal blog and hence has no licence to kill me). Anyway. With tumblr I don’t need to host a blog software myself. And I don’t end up having my blog hacked and then seeing my blog being displayed as a malware site in Chrome/Firefox (happened to me twice). And then with tumblr I create new blogs with subdomains within minutes. Cool stuff. Hail to the cloud, baby!

So here you go: A complete guide how to fully migrate your wordpress blog to tumblr. Including comments and pictures. And still supporting your old url scheme.

Update: I ran into a tool that claims to do a lot for you: It doesn’t migrate images and 302 redirects. Not sure about comments migration. And it costs 24$. Still, if you can leave out some of the steps below that’d be worth the money. Comments of the author on quora