Summary: We learnt how to scrape a website using Selenium in Python and get large amounts of data. You can carry out multiple unstructured data analytics and find interesting trends, sentiments, etc. Using this data. If anyone is interested in looking at the complete code, here is the link to my Github. Let me know if this was helpful.
Selenium refers to a number of different open-source projects used for browser automation. It supports bindings for all major programming languages, including our favorite language: Python. The Selenium API uses the WebDriver protocol to control a web browser, like Chrome, Firefox or Safari. The browser can run either localy or remotely. Selenium is a great tool for web scraping, especially when learning the basics. But, depending on your goals, it is sometimes easier to choose an already-built tool that does web scraping for you. Building your own scraper is a long and resource-costly procedure that might not be worth the time and effort.
- What this is for: Scraping web pages to collect review data and storing the data into a CSV
- Requirements: Python Anaconda distribution, Basic knowledge of HTML structure and Chrome Inspector tool
- Concepts covered: Selenium, Error exception handling
- Download the entire Python file
In an earlier blog post, I wrote a brief tutorial on web scraping with BeautifulSoup. This is a great tool but has some limitations, particularly if you need to scrape a page with content loaded via AJAX.
Enter Selenium. This is a Python library that is capable of scraping AJAX generated content. Before we continue, it is important to note that Selenium is technically a testing tool, not a scraper.
That said, Selenium is simple to use and can get the job done. In this tutorial, we’ll set up a code similar to what you would need to scrape review data from a website and store it in a CSV file.
Install Selenium library
First, we’ll install the Selenium library in Anaconda.
Click on your Start menu and search for Anaconda Prompt. Open a new Anaconda Prompt window.
Change the directory to where you have Anaconda installed. For example
It will take a moment to load and ask for consent to install. Once installed, open Anaconda Navigator and go to the Environment tab. Search packages to make sure it installed.
We’ll also need to install Chromedriver for the code to work. This essentially lets the code take control of a Chrome browser window.
Chromedriver is available for download here. Extract the ZIP file and save the .EXE somewhere on your computer.
Getting started in Python
First we’ll import our libraries and establish our CSV and Pandas dataframe.
Next we’ll define the URLs we want to scrape as an array. We’ll also define the location of our web driver EXE file.
Because we’re scraping multiple pages, we’ll create a for loop to repeat our data gathering steps for each site.
Selenium has the ability to grab elements by their ID, class, tag, or other properties. To find the ID, class, tag or other property you want to scrape, right click within Chrome browser and select Inspect (or you can press F12 to open the Inspector window).
In this case we’ll start with collecting the H1 data. This is simple with the find_element_by_tag_name method.
Next, we’ll collect the type of business. For this example, the site I was scraping needed this data cleaned a little bit because of how the data was stored. You may run into a similar situation, so let’s do some basic text cleaning.
When I looked at the section markup with Chrome Inspector, it looks something like this:
In order to send clean data to the CSV, we’ll need to remove the “Categories:” text and replace line breaks with a pipe character to store data like this: “Type1 Type2”. This is how we can accomplish that:
Scraping With Selenium
Scraping other elements
For the other elements, we’ll use Selenium’s other methods to capture by class.
Now, let’s piece all the data together and add it to our dataframe. Using the variables we created, we’ll populate a new row to the dataframe.
One error you may encounter is if data is missing. For example, if a business doesn’t have any reviews or comments, the site may not render this div that contains this info into to the page.
If you attempt to scrape a div that doesn’t exist, you’ll get an error. But Python lets you handle errors with the try block.
Scrapy Selenium Example
So let’s assume our business may not have a star rating. In the try: block we’ll write the code for what to do if the “starrating” class exists. In the except: block, we’ll write code for what to do if the try: block returns an error.
A word of caution: If you are planning to do statistical analysis of the data, be careful how you replace error data in the “except” block. For example, if your code cannot find the number of stars, entering this data as “0” will skew any data because there is a difference between having a 0-star rating and not having a star rating. So for this example, data that returns an error will produce a “-” in the dataframe and CSV file instead of a 0.