Manage blacklisted request with Scrapy¶
Goal¶
A scraper is downloading pages of a website.
However, the website has a rate limit by IP. When the scraper downloads 10 pages, the website returns only an empty page with a HTTP 429 status.
Does the scraper must wait when the limit is reached ? No!
The scraper has to ask Scrapoxy to replace the instance.
Step 1: Create a Scraper¶
See Integrate Scrapoxy to Scrapy to create the scraper.
Edit settings of Scrapoxy¶
Add this content to myscraper/settings.py
:
CONCURRENT_REQUESTS_PER_DOMAIN = 1
RETRY_TIMES = 0
# PROXY
PROXY = 'http://127.0.0.1:8888/?noconnect'
# SCRAPOXY
API_SCRAPOXY = 'http://127.0.0.1:8889/api'
API_SCRAPOXY_PASSWORD = 'CHANGE_THIS_PASSWORD'
# BLACKLISTING
BLACKLIST_HTTP_STATUS_CODES = [ 503 ]
DOWNLOADER_MIDDLEWARES = {
'scrapoxy.downloadmiddlewares.proxy.ProxyMiddleware': 100,
'scrapoxy.downloadmiddlewares.wait.WaitMiddleware': 101,
'scrapoxy.downloadmiddlewares.scale.ScaleMiddleware': 102,
'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware': None,
'scrapoxy.downloadmiddlewares.blacklist.BlacklistDownloaderMiddleware': 950,
}
Warning
Don’t forget to change the password!
Edit settings of the Scraper¶
Change the password of the commander in my-config.json
:
"commander": {
"password": "CHANGE_THIS_PASSWORD"
},
Warning
Don’t forget to change the password!