xpdf

I’m trying to get text extraction from pdfs working on lambda for a little fun project of mine.

Now. There are a lot of possibilities how to extract text from pdfs using python but nothing really worked for me:

  • pypdf2 just returned newlines for my test pdfs
  • tika (which calls apache tika) was too slow (needs to start a java server first on localhost)

Finally I ended up using xpdfs pdftotext. Sadly I couldn’t install xpdf on AWS EC2 (Amazon Linux), so I needed to compile it, but turned out it is quite straigtforward:

sudo yum install -y cmake freetype-devel clang
cmake -DCMAKE_BUILD_TYPE=Release
make

There will be warnings about qt missing, but this is not relevant as we’re only interested in the xpdf tools.

Make produces xpdf/pdftotext which has only one shared lib dependency (/usr/lib64/libstdc++.so.6) you’d need to fix in order to make it work on AWS lambda:

Copy pdftotext into your lambda root and /usr/lib64/libstdc++.so.6 into lib/libstdc++.so.6 and then you can call pdftotext like this:

    import os, subprocess
    SCRIPT_DIR = os.path.dirname(os.path.abspath(__file__))
    LIB_DIR = os.path.join(SCRIPT_DIR, 'lib')

    args = ["{}/pdftotext".format(SCRIPT_DIR), 
            '-enc',
            'UTF-8',
            "my.pdf",
            '-']
    env = os.environ.copy()
    env.update(dict(LD_LIBRARY_PATH=LIB_DIR))
    res = subprocess.run(args, 
        stdout=subprocess.PIPE,
        stderr=subprocess.PIPE,
        env=env)
    if res.returncode > 0:
        print("pdftotext exited with {}:\n{}".format(res.returncode, res.stderr))
        raise Exception
    output = res.stdout.decode('utf-8')

In my case I was only interested in non-whitespace characters so I added words = re.sub("\W+", " ", output)