How to Turn Web Scraping Into a Computer Vision Problem

Let’s face it. Web scraping can be a bit boring sometimes.

The general process is:

Find site that has data
Inspect element and see where data lives
If in html table, write logic to parse table
run for one page
run on all pages
fix minor bugs
done

Each and every site has little quirks. Some people use “<table>” tags wrong and put column information in a normal <tr><td> row instead of using the <thead> tag. And then sometimes the website updates so you have to make more minor updates. Other people don’t and are still living pre-2000 with everything on their site a part of one big table.

In this tutorial, I’m going to show how we can turn web scraping into an image based problem instead of being purely a DOM parsing exercise.

Setup

We’re going to be using Selenium to control our web browser and take snapshots.

You’ll need:

python3 (I recommend conda distribution)
- opencv
- selenium
- PIL
chrome
chrome web driver (it has to be the matching web driver for your version of chrome. Best to have a non-updating separate install)

Go to the Jupyter notebook.

You can also generate videos by using the ffmpeg library and opencv.

How You Can Use This

Say you’re designing a computer vision system to extract information from the real world with the goal of replicating it digitally. Now you can generate training data and output data using HTML!

Or you want to build a web scraper to scrape sites that obscure the DOM and update frequently.

Or you want to build a more intelligent webscraper.

Ergo Sum

Thoughts from a person

How to Turn Web Scraping Into a Computer Vision Problem

Setup

How You Can Use This

Leave a Reply Cancel reply