Function to web scrape tables from several pages - python

I am learning Python and I am trying to create a function to web scrape tables of vaccination rates from several different web pages - a github repository for Our World in Data https://github.com/owid/covid-19-data/tree/master/public/data/vaccinations/country_data and https://ourworldindata.org/about. The code works perfectly when web scraping a single table and saving it into a data frame...
import requests
from bs4 import BeautifulSoup
import pandas as pd
url = "https://github.com/owid/covid-19-data/blob/master/public/data/vaccinations/country_data/Bangladesh.csv"
response = requests.get(url)
response
scraping_html_table_BD = BeautifulSoup(response.content, "lxml")
scraping_html_table_BD = scraping_html_table_BD.find_all("table", "js-csv-data csv-data js-file-line-container")
df = pd.read_html(str(scraping_html_table_BD))
BD_df = df[0]
But I have not had much luck when trying to create a function to scrape several pages. I have been following the tutorial on this website 3 in the section 'Scrape multiple pages with one script' and StackOverflow questions like 4 and 5 amongst other pages. I have tried creating a global variable first but I end up with errors like "Recursion Error: maximum recursion depth exceeded while calling a Python object". This is the best code I have managed as it doesn't generate an error but I've not managed to save the output to a global variable. I really appreciate your help.
import pandas as pd
from bs4 import BeautifulSoup
import requests
link_list = ['/Bangladesh.csv',
'/Nepal.csv',
'/Mongolia.csv']
def get_info(page_url):
page = requests.get('https://github.com/owid/covid-19-data/tree/master/public/data/vaccinations/country_data' + page_url)
scape = BeautifulSoup(page.text, 'html.parser')
vaccination_rates = scape.find_all("table", "js-csv-data csv-data js-file-line-container")
result = {}
df = pd.read_html(str(vaccination_rates))
vaccination_rates = df[0]
df = pd.DataFrame(vaccination_rates)
print(df)
df.to_csv("testdata.csv", index=False)
for link in link_list:
get_info(link)
edit: I can view the final webpage that is iterated as it saves to a csv file, but not the data from the preceding links.
new = pd.read_csv('testdata6.csv')
pd.set_option("display.max_rows", None, "display.max_columns", None)
new

This is because in every iteration your 'testdata.csv' is overwritten with a new one.
so you can do :
df.to_csv(page_url[1:], index=False)

I'm guessing you're overwriting your 'testdata.csv' each time, hence why you can see the final page. I would either add an enumerate function to add an identifier for a separate csv each time you scrape a page, eg:
for key, link in enumerate(link_list):
get_info(link, key)
...
df.to_csv(f"testdata{key}.csv", index=False)
Or, open this csv as part of your get_info function, steps of which are available in append new row to old csv file python.

Related

How do I get the data from a table of google spreadsheet using requests in python?

I am doing a project that asks me to obtain 3 databases from google spreadsheet using the requests library (I do the processing afterwards). The problem is that when I get the GET of the url and apply a ".text" or ".json" or ".content" it gives me the structured information of the entire spreadsheet but I want the values of the row and column. Any ideas???
Here are the spreadsheets:
https://docs.google.com/spreadsheets/d/1o8QeMOKWm4VeZ9VecgnL8BWaOlX5kdCDkXoAph37sQM/edit#gid=1691373423
https://docs.google.com/spreadsheets/d/1o8QeMOKWm4VeZ9VecgnL8BWaOlX5kdCDkXoAph37sQM/edit
https://docs.google.com/spreadsheets/d/1udwn61l_FZsFsEuU8CMVkvU2SpwPW3Krt1OML3cYMYk/edit
The best way of getting the data from a google spreadsheet in python is by using gspread, which is a Python API for Google Sheets.
However, there are alternatives if you aren't the owner of the spreadsheet (or you just want to do it by other method as an exercise). For instance, you can do it using requests and bs4 modules as you can see in this answer.
Applied to your specific case, the code would look like this ("Datos Argentina - Salas de Cine" spreadsheet):
import typing
import requests
from bs4 import BeautifulSoup
def scrapeDataFromSpreadsheet() -> typing.List[typing.List[str]]:
html = requests.get('https://docs.google.com/spreadsheets/d/1o8QeMOKWm4VeZ9VecgnL8BWaOlX5kdCDkXoAph37sQM/edit#gid=1691373423').text
soup = BeautifulSoup(html, 'lxml')
salas_cine = soup.find_all('table')[0]
rows = [[td.text for td in row.find_all("td")] for row in salas_cine.find_all('tr')]
return rows
Important note: with the link provided (and the code above) you will only be able to get the first 100 rows of data!
This can be fixed in more than one way. What I've tried is modifying the url of the spreadsheet to display the data as a simple html table (reference).
Old url: https://docs.google.com/spreadsheets/d/1o8QeMOKWm4VeZ9VecgnL8BWaOlX5kdCDkXoAph37sQM/edit#gid=1691373423
New url: (remove edit#gid=1691373423 and add gviz/tq?tqx=out:html&tq&gid=1) https://docs.google.com/spreadsheets/d/1o8QeMOKWm4VeZ9VecgnL8BWaOlX5kdCDkXoAph37sQM/gviz/tq?tqx=out:html&tq&gid=1
Now you are able to obtain all the rows that the spreadsheet contains:
def scrapeDataFromSpreadsheet() -> typing.List[typing.List[str]]:
html = requests.get('https://docs.google.com/spreadsheets/u/0/d/1o8QeMOKWm4VeZ9VecgnL8BWaOlX5kdCDkXoAph37sQM/gviz/tq?tqx=out:html&tq&gid=1').text
soup = BeautifulSoup(html, 'lxml')
salas_cine = soup.find_all('table')[0]
rows = [[td.text for td in row.find_all("td")] for row in salas_cine.find_all('tr')]
return rows

how to fetch text data from website and storing as excel file using python

I want to create a script that fetches the all the data in the following website : https://www.bis.doc.gov/dpl/dpl.txt and store it in a excel file and count the number of records in it, using python language. I've tried to achieve by implementing the code as:
import requests
import re
from bs4 import BeautifulSoup
URL = "https://www.bis.doc.gov/dpl/dpl.txt"
page = requests.get(URL)
soup = BeautifulSoup(page.text, "lxml")
print(soup)
I've fetched the data but didn't know the next step of storing it as excel file. Anyone pls guide or share your valuable ideas. Thank you in advance!
You can do it with pandas easily. Since the data is in tab seperated value.
Note: openpyxl needs to be installed for this to work.
import requests
import io
import pandas as pd
URL = "https://www.bis.doc.gov/dpl/dpl.txt"
page = requests.get(URL)
df = pd.read_csv(io.StringIO(page.text), sep="\t")
df.to_excel(r'i_data.xlsx', index = False)

Scrape text from a list of different urls using Python

I have a list of different URLs which I would like to scrape the text from using Python. So far I've managed to build a script that returns URLs based on a Google Search with keywords, however I would now like to scrape the content of these URLs. The problem is that I'm now scraping the ENTIRE website including the layout/style info, while I only would like to scrape the 'visible text'. Ultimately, my goal is to scrape for names of all these urls, and store them in a pandas DataFrame. Perhaps even include how often certain names are mentioned, but that is for later. Below is a rather simple start of my code so far:
from urllib.request import Request, urlopen
from bs4 import BeautifulSoup
import requests
from time import sleep
from random import randint
import spacy
import en_core_web_sm
import pandas as pd
url_list = ["https://www.nhtsa.gov/winter-driving-safety", "https://www.safetravelusa.com/", "https://www.theatlantic.com/business/archive/2014/01/how-2-inches-of-snow-created-a-traffic-nightmare-in-atlanta/283434/", "https://www.wsdot.com/traffic/passes/stevens/"]
df = pd.DataFrame(url_list, columns = ['url'])
df_Names = []
# load english language model
nlp = en_core_web_sm.load()
# find Names in text
def spacy_entity(df):
df1 = nlp(df)
df2 = [[w.text,w.label_] for w in df1.ents]
return df2
for index, url in df.iterrows():
print(index)
print(url)
sleep(randint(2,5))
# print(page)
req = Request(url[0], headers={"User-Agent": 'Mozilla/5.0'})
webpage = urlopen(req).read()
soup = BeautifulSoup(webpage, 'html5lib').get_text()
df_Names.append(spacy_entity(soup))
df["Names"] = df_Names
For getting the visible text with BeautifoulSoup, there is already this answer:
BeautifulSoup Grab Visible Webpage Text
Once you get your visible text, if you want to extract "names" (I'm assuming by names here you mean "nouns"), you can check nltk package (or Blob) on this other answer: Extracting all Nouns from a text file using nltk
Once you apply both, you can ingest your outputs into pandas DataFrame.
Note: Please notice that extracting the visible text given an HTML it is still an open problem. This two papers can highlight the problem way better than I can and they are both using Machine Learning techniques: https://arxiv.org/abs/1801.02607, https://dl.acm.org/doi/abs/10.1145/3366424.3383547.
And their respective githubs https://github.com/dalab/web2text, https://github.com/mrjleo/boilernet

unable to parse html table with Beautiful Soup

I am very new to using Beautiful Soup and I'm trying to import data from the below url as a pandas dataframe.
However, the final result has the correct columns names, but no numbers for the rows.
What should I be doing instead?
Here is my code:
from bs4 import BeautifulSoup
import requests
def get_tables(html):
soup = BeautifulSoup(html, 'html.parser')
table = soup.find_all('table')
return pd.read_html(str(table))[0]
url = 'https://www.cmegroup.com/trading/interest-rates/stir/eurodollar.html'
html = requests.get(url).content
get_tables(html)
The data you see in the table is loaded from another URL via JavaScript. You can use this example to save the data to csv:
import json
import requests
import pandas as pd
data = requests.get('https://www.cmegroup.com/CmeWS/mvc/Quotes/Future/1/G').json()
# uncomment this to print all data:
# print(json.dumps(data, indent=4))
df = pd.json_normalize(data['quotes'])
df.to_csv('data.csv')
Saves data.csv (screenshot from LibreOffice):
The website you're trying to scrape data from is rendering the table values dynamically and using requests.get will only return the HTML the server sends prior to JavaScript rendering.
You will have to find an alternative way of accessing the data or render the webpages JS (see this example).
A common way of doing this is to use selenium to automate a browser which allows you to render the JavaScript and get the source code that way.
Here is a quick example:
import time
import pandas as pd
from selenium.webdriver import Chrome
#Request the dynamically loaded page source
c = Chrome(r'/path/to/webdriver.exe')
c.get('https://www.cmegroup.com/trading/interest-rates/stir/eurodollar.html')
#Wait for it to render in browser
time.sleep(5)
html_data = c.page_source
#Load into pd.DataFrame
tables = pd.read_html(html_data)
df = tables[0]
df.columns = df.columns.droplevel() #Convert the MultiIndex to an Index
Note that I didn't use BeautifulSoup, you can directly pass the html to pd.read_html. You'll have to do some more cleaning from there but that's the gist.
Alternatively, you can take a peak at requests-html which is a library that offers JavaScript rendering and might be able to help, search for a way to access the data as JSON or .csv from elsewhere and use that, etc.

Scraping HTML tables to CSV's using BS4 for use with Pandas

I have begun a pet-project creating what is essentially an indexed compilation of a plethora of NFL statistics with a nice simple GUI. Fortunately, the site https://www.pro-football-reference.com has all the data you can imagine in the form of tables which can be exported to CSV format on the site and manually copied/pasted. I started doing this, and then using the Pandas library, began reading the CSVs into DataFrames to make use of the data.
This works great, however, manually fetching all this data is quite tedious, so I decided to attempt to create a web scraper that can scrape HTML tables and convert them into a usable CSV format. I am struggling, specifically to isolate individual tables but also with having the CSV that is produced render in a readable/usable format.
Here is what the scraper looks like right now:
from bs4 import BeautifulSoup
import requests
import csv
def table_Scrape():
url = 'https://www.pro-football-reference.com/teams/nwe/2008.htm'
req = requests.get(url)
soup = BeautifulSoup(req.text, 'html.parser')
table = soup.select_one('table.stats_table')
headers = [th.text.encode("utf-8") for th in table.select("tr th")]
with open("out.csv", "w", encoding='utf-8') as f:
wr = csv.writer(f)
wr.writerow(headers)
wr.writerows([
[td.text.encode("utf-8") for td in row.find_all("td")]
for row in table.select("tr + tr")
])
table_Scrape()
This does properly send the request to the URL, but doesn't fetch the data I am looking for which is 'Rushing_and_Receiving'. Instead, it fetches the first table on the page 'Team Stats and Ranking'. It also renders the CSV in a rather ugly/not useful format like so:
b'',b'',b'',b'Tot Yds & TO',b'',b'',b'Passing',b'Rushing',b'Penalties',b'',b'Average Drive',b'Player',b'PF',b'Yds',b'Ply',b'Y/P',b'TO',b'FL',b'1stD',b'Cmp',b'Att',b'Yds',b'TD',b'Int',b'NY/A',b'1stD',b'Att',b'Yds',b'TD',b'Y/A',b'1stD',b'Pen',b'Yds',b'1stPy',b'#Dr',b'Sc%',b'TO%',b'Start',b'Time',b'Plays',b'Yds',b'Pts',b'Team Stats',b'Opp. Stats',b'Lg Rank Offense',b'Lg Rank Defense'
b'309',b'4944',b'920',b'5.4',b'22',b'8',b'268',b'288',b'474',b'3222',b'27',b'14',b'6.4',b'176',b'415',b'1722',b'8',b'4.1',b'78',b'81',b'636',b'14',b'170',b'30.6',b'12.9',b'Own 27.8',b'2:38',b'5.5',b'29.1',b'1.74'
b'8',b'5',b'',b'',b'8',b'13',b'1',b'',b'12',b'12',b'13',b'5',b'13',b'',b'4',b'6',b'4',b'7',b'',b'',b'',b'',b'',b'1',b'21',b'2',b'3',b'2',b'5',b'4'
b'8',b'10',b'',b'',b'20',b'20',b'7',b'',b'7',b'11',b'31',b'15',b'21',b'',b'11',b'15',b'4',b'15',b'',b'',b'',b'',b'',b'24',b'16',b'5',b'13',b'14',b'15',b'11'
I know my issue with fetching the correct table lies within the line:
table = soup.select_one('table.stats_table')
I am what I would still consider a novice in Python, so if someone can help me be able to query and parse a specific table with BS4 into CSV format I would be beyond appreciative!
Thanks in advance!
The pandas solution didn't work for me due to the ajax load, but you can see in the console the URL each table is loading from, and request to it directly. In this case, the URL is: https://widgets.sports-reference.com/wg.fcgi?css=1&site=pfr&url=%2Fteams%2Fnwe%2F2008.htm&div=div_rushing_and_receiving
You can then get the table directly using its id rushing_and_receiving.
This seems to work.
from bs4 import BeautifulSoup
import requests
import csv
def table_Scrape():
url = 'https://widgets.sports-reference.com/wg.fcgi?css=1&site=pfr&url=%2Fteams%2Fnwe%2F2008.htm&div=div_rushing_and_receiving'
req = requests.get(url)
soup = BeautifulSoup(req.text, 'html.parser')
table = soup.find('table', id='rushing_and_receiving')
headers = [th.text for th in table.findAll("tr")[1]]
body = table.find('tbody')
with open("out.csv", "w", encoding='utf-8') as f:
wr = csv.writer(f)
wr.writerow(headers)
for data_row in body.findAll("tr"):
th = data_row.find('th')
wr.writerow([th.text] + [td.text for td in data_row.findAll("td")])
table_Scrape()
I would bypass beautiful soup altogether since pandas works well for this site. (at least the first 4 tables I glossed over)
Documentation here
import pandas as pd
url = 'https://www.pro-football-reference.com/teams/nwe/2008.htm'
data = pd.read_html(url)
# data is now a list of dataframes (spreadsheets) one dataframe for each table in the page
data[0].to_csv('somefile.csv')
I wish I could credit both of these answers as correct, as they are both useful, but alas, the second answer using BeautifulSoup is the better answer since it allows for the isolation of specific tables, whereas the nature of the way the site is structured limits the effectiveness of the 'read_html' method in Pandas.
Thanks to everyone who responded!

Categories