Part III of web scraping is all about scraping tables. As I mentioned earlier, I love data. Checking statistics, analyzing data, finding correlations, discovering significant differences, regression analysis, combining unexpected variables to form new perspectives on certain topics and gain interesting insights. Data always tells you a story, and I simply love that. But again, you need to have the data in order to actually be able to analyse and get new insights.
Tables on websites already contain data, often in a structured way, which can be very useful to use as an extra source of data on top of your own data. It’s almost like ready to go data (if you know how to scrape it) which can be merged with other insightful data to create a possible synergy.
Worldometer is a website with plenty of insightful population related information which can be used alongside any kind of your own information to compare, analyse or could help create a valuable synergy .
Worldometer is just an example, obviously there are plenty more websites that contain useful data that might add value to your specific data.
You should always check if you are allowed to scrape the data if you can’t download the data into a csv. To find more information about that, check my next article about tips & tricks for web scraping.
Just as previous cases, I’ll start importing the libraries I need:
from bs4 import BeautifulSoup
import pandas as pd
After that, I set the URL that I want to use to scrape data from:
url = 'https://www.worldometers.info/coronavirus/'
page = requests.get(url)
soup = BeautifulSoup(page.text, 'lxml')
I decided to print(soup), to check whether I scraped the data I wanted. It’s also useful to use to collect the right ‘table’, ‘tr’ or ‘th’ that you want/need. Some people might find it useful to have it in Spyder to check what you need to scrape. From my experience so far, I find it easier to scroll through the console on the website itself and copy paste whatever I need.
Next step is to name and pull your specific data from the page. In this case I need the table and from the main table I need the id ‘main_table_countries_today’.
table_data = soup.find('table', id = 'main_table_countries_today')
Again, check by print(table_data) to see whether that is what you want. If so, continue.
One thing that I learned from this web scraping self-learning course is testing. I always tried to write the whole code and run it, only to find out you did something wrong in the beginning and you had to do it all over again. Unfortunately I seemed to continue to write the whole script AGAIN. Only to find out I did something wrong somewhere a bit further down the line AGAIN, which meant I had to do it all over, AGAIN. I would say that it definitely kept me busy, but it’s not the most efficient way. You do learn a lot from it though, although I wouldn’t recommend it to everyone.
Pulling the data
When you have the data that you want, make sure you pull all the headers from the table that you want, by creating a loop for it, adding it to your dataframe.
headers = 
for i in table_data.find_all('th'):
title = i.text
df = pd.DataFrame(columns = headers)
The last step is to also pull the rest of the data from the table that you want, via a loop, and add this to your dataframe (called df in this case).
for j in table_data.find_all('tr')[1:]:
row_data = j.find_all('td')
row = [tr.text for tr in row_data]
length = len(df)
df.loc[length] = row
Once you’ve done this, again, you can check by print(df) to make sure all the data comes out the way you want. Depending on your table, ‘/n’ or ‘/t’ can be added because of empty spaces in each cell for example. If that’s the case, it will come out via your print command. You can delete those extra letters by using strip() or replace() in your script. You can find this and more in my next article.
The last step is to transform this into a csv file, for now, but of course you can do anything you like with it.
df.to_csv('/Users/xxx/xxx/covid19.csv', index=False, encoding='utf-8')
I hope this was useful for you, if there are any questions, don’t hesitate to ask me.