How to use Brightdata with scrapy

Brightdatas Proxy Manager (click to see full res)

With weegee.ch I run a handful crawlers. At the beginning I didn’t have problems with crawling protection but over time my little project got more successful (I’m on the first page of Google for the relevant search queries since a few weeks) so I see more and more websites blocking me.

I went for Brightdata as they offer a good “pay as you use” model which works well for smaller sites like mine (and don’t require to pay a minimal price per month).

I couldn’t find a documentation which covers the whole install process, that’s why I’m writing this post.

What you’ll need before following this howto

a brightdata account
I’m assuming you have a linux host where you’re running your crawlers

Install luminati-proxy

To use Brightdata’s service you need a daemon which runs locally. Scrapy will connect to the daemon on localhost, the daemon then connects to Brightdatas proxy. You have several “proxy types”:

datacenter (cheapest option)
isp
residential, mobile, … (for these you need to go through a compliance process to prove that your use case is not fraudulent)

There is an official installation guide to install luminati-proxy (I chose manual install, their automated bash script didn’t work and is too scary anyway as it installs node versions on its own - something I’d rather want to do myself).

Go sure you double check the node and npm version. I advice to first install on your laptop to get it running first before installing it on your server:

double check your node version (node --version) and npm version (npm --version) match the requirements
if you need to upgrade/downgrade best use n (install with npm install -g n). Note: the npm version somehow up/downgrades with the node version. If it doesn’t, use sudo npm install -g npm@8.1.3
you might need to start a new shell
finally, install with sudo npm install -g @luminati-io/luminati-proxy
Go sure you have no error message (I did run into the error message error: no matching function for call to ‘v8::FunctionTemplate::GetFunction() which - although the proxy daemon started - then made all requests hanging in a pending state)

Now, start the proxy with proxy-manager which among other output should show this:

| ================================================ |
|                                                  |
|                                                  |
| Open admin browser:                              |
| http://127.0.0.1:22999                           |
| ver. 1.357.632                                   |
|                                                  |
| Do not close the process while using the         |
| Proxy Manager                                    |
|                                                  |
|                                                  |
| ================================================ |

Now, open your browser at http://127.0.0.1:22999 (important: HTTPS does not work!) and log in with your brightdata account.

Testing the luminati proxy

Before jumping into scrapy, make sure that the proxy is really working: In the console, run curl -x localhost:24000 http://lumtest.com/myip.json which will use luminati as the proxy. Port 24000 is the default port, you can configure more ports with your proxy manager web view adding “add new port”

If all is successful you should see…

In your proxy manager web view in the overview tab the request pops up in the bottom half of the screen
The result of curl should show an IP which is not your own

If this is both true then congrats: Stage 1 is clear, you have a scraping proxy running!

Connect Scrapy to luminati proxy

This is the easy part, all you need is to pip install scrapyx-bright-data and then add this config to your crawler:

  custom_settings = {
      'DOWNLOADER_MIDDLEWARES': {
          'scrapyx_bright_data.BrightDataProxyMiddleware': 610,
      },
      'RETRY_TIMES': 20,
      'BRIGHTDATA_ENABLED': True,
      'BRIGHTDATA_URL': 'http://127.0.0.1:24000'
  }

I’m liking this per-crawler config as I’ll only pipe a few of my crawlers through the proxy. Those who don’t need a proxy I leave without as I’m paying for every request going through Brightdata.

Now, when running the crawler with scrapy runspider mycrawler.py -L WARN then you should be seeing the requests on your proxy manager window.

Installing on server

The procedure on the server is similar than when installing locally. The only difference is the way you access the proxy manager.

make sure port 22999 is open from outside (you could make it open to the world as brightdata does ip protection for you)
connect you browser to http://your.ip-address:22999 (e.g. http://123.123.123.123:22999). Important: https does not work, it’s http only. And the domainname of my linux box didn’t work either as it would always redirect me to https
after logging in you’ll be blocked and need to enable your ip on the linux host with running e.g. lpm_whitelist_ip 123.123.123.123

That’s it! After this I was delighted to see my scrapy crawler being unblocked and fresh data flowing in again to my little project of mine 🎉