Skip to content

phx000/linkedin-scraper

Repository files navigation

LinkedIn Sales API scraper

Screenshot of the dashboard

What's this?

A full-end web-app + Chrome extension for automating scraping data using the LinkedIn Sales API. See the live demo at mainq.io. Here are some of the most useful features implemented in the backend:

  • Connect multiple LinkedIn Sales accounts. The backend will distribute concurrent API calls between them and automatically switch when a Sales account gets rate-limited.
  • Stores downloaded Account (companies) and Lead (people) data in PostgreSQL, allowing to easily export as JSON or CSV.
  • Data deduplication.

About

I built this with the end goal of monetizing the solution as a subscription-based SaaS. That's not my goal anymore so I'm open sourcing this. The LinkedIn Sales API isn't public and there are no docs about it. There was a lot of reverse engineering in this project and I want to make it available to anybogy interested.

Note that most of the code in this repo is focused on the web-app component of the task (controlling things like limits on concurrency per user, automatically switching between sales accounts and other "product" oriented features). The Linkedin Sales API code is mostly in linkedin.py.

Architecture

Looking back at it, quite overengineered. Runs via docker-compose:

                         ┌───────────────┐
                         │     NGINX     │
                         └───────────────┘
                               │   │
                       ┌───────┘   └───────┐
                       ↓                   ↓
              ┌─────────────────┐ ┌─────────────────┐
              │ API (replica 1) │ │ API (replica 2) │
              └─────────────────┘ └─────────────────┘
                       │                   │
                       └─────────┐─────────┘
   ┌──────────────┐              │
   │   postgres   │              ↓
   └──────────────┘      ┌───────────────┐      ┌──────────────┐
                         │ Orchestrator  │      │    redis     │
   ┌──────────────┐      └───────────────┘      └──────────────┘
   │ disk volume  │              │
   └──────────────┘              ↓
                            ┌────────┐
                            │  NATS  │
                            └────────┘
                                 │
                  ┌──────────────└──────────────┐
                  ↓                             ↓
       ┌────────────────────┐        ┌────────────────────┐
       │ Worker (replica 1) │        │ Worker (replica 2) │
       └────────────────────┘        └────────────────────┘
  • NGINX acts as the reverse proxy.
  • The multiple API replicas handle things like authentication but their main job is to create and schedule user-requested scraping jobs.
  • The orchestrator feeds scraping jobs to workers accoding to the job-concurrency limits per user. This allows users to create jobs in bulk, while the backend feeds jobs to workers at a controlled rate to avoid getting rate limited by LinkedIn too fast as well as acting as a rate limiter to protect this this backend's workers. The orchestrator internally stores a queue per user which gets asynchronously consumed as jobs get completed.
  • The workers run scraping, export, deduplication, etc., to avoid overloading the main API servers. They are handled by the TaskIQ framework.
  • NATS is the event broker, used to asynchronously feed jobs into the workers.
  • All data is stored in Postgres. LinkedIn Sales responses are stored in JSONB columns.
  • A shared docker volume is used to share large static CSV files between the workers and the API servers.
  • Redis is used to store TaskIQ results, which can be awaited from the API workers.

About

LinkedIn Sales API scraping backend and web-app

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors