Skip to content

Seraph2000/linkfinder

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

8 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

linkfinder

Scrapy spider to crawl all links on a website

Installation

Requirements

Create a virtual environment with virtualenv

virtualenv --python=<path to python> <name of virtual environment>

Enter virtual environment, and install Scrapy inside it, together with any necessary dependencies

source <environment name>/bin/activate pip install scrapy

Running Instructions

cd into the root directory of the Scrapy project linkfinder, then run the following command

scrapy run links -o <file-name>.<format>

Where the data output could be a .csv, .json, or .xml, for example, depending upon the post processing.

Trouble Shooting and Optimisation

Common scraping issues

  • It's likely that the owner has made their website dynamic to deter scraping, but it may nevertheless be useful to determine whether a website is dynamic or not

  • Dynamic web content

    • solution 1: utilise the Python requests library if the website backend isn't too complex
      • study the requests happening behind the scenes, and or whether cookies are being used.
    • solution 2: utilise Selenium, if this is not going to impact too much on speed of scraping
    • solution 3: determine if the website has alternative urls (i.e. containing 'data', or even json formatted pages)
  • Robot blocking software

    • software which runs in the background on websites, to protect their data, designed to detect bots
  • CAPTCHAS

Froggy Weirdness

About

Scrapy spider to crawl all links on a website

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages