Scrapy spider to crawl all links on a website
- Python 3.7: https://www.python.org/downloads/
- Terminal: Using Ubuntu 18 terminal in this example.
- Virtualenv: https://pypi.org/project/virtualenv/
- Scrapy: https://scrapy.org/
virtualenv --python=<path to python> <name of virtual environment>
source <environment name>/bin/activate
pip install scrapy
scrapy run links -o <file-name>.<format>
Where the data output could be a .csv, .json, or .xml, for example, depending upon the post processing.
-
It's likely that the owner has made their website dynamic to deter scraping, but it may nevertheless be useful to determine whether a website is dynamic or not
-
Dynamic web content
- solution 1: utilise the Python requests library if the website backend isn't too complex
- study the requests happening behind the scenes, and or whether cookies are being used.
- solution 2: utilise Selenium, if this is not going to impact too much on speed of scraping
- solution 3: determine if the website has alternative urls (i.e. containing 'data', or even json formatted pages)
- solution 1: utilise the Python requests library if the website backend isn't too complex
-
Robot blocking software
- software which runs in the background on websites, to protect their data, designed to detect bots
-
CAPTCHAS
- if the captcha isn't the latest one, it may be possible to plug in a library such as python-anticaptcha: https://pypi.org/project/python-anticaptcha/
