I'm trying to get a table that is located inside multiple nests.
I'm new to Beautifulsoup and I have practiced some simple eeemples.
The issue is that, I can't understand why my code can't get the "div" tag that has the class "Explorer is-embed".
Because from that point, I can go deeper to get to the tbody where all the data that I want to scrape are located.
thanks for your help in advance.
Below is my code:
url = "https://ourworldindata.org/covid-cases"
url_content = requests.get(url)
soup = BeautifulSoup(url_content.text, "lxml")
########################
div1 = soup3.body.find_all("div", attrs={"class":"content-wrapper"})
div2 = div1[0].find_all("div", attrs={"class":"offset-content"})
sections = div2[0].find_all('section')
figure = sections[1].find_all("figure")
div3 = figure[0].find_all("div")
div4 = div3[0].find_all("div")
Here is a snapshot of the "div" tag that I'm not getting.
Figure
Data is dynamically loaded. Instead, grab the public source csv (other formats available)
https://ourworldindata.org/coronavirus-source-data
import pandas as pd
df = pd.read_csv('https://covid.ourworldindata.org/data/owid-covid-data.csv')
df.head()
Values you see in the Daily new confirmed COVID-19 cases (per 1M)
table are calculated from the same data as in that file for the two dates being compared e.g.
Related
I am learning Python and I am trying to create a function to web scrape tables of vaccination rates from several different web pages - a github repository for Our World in Data https://github.com/owid/covid-19-data/tree/master/public/data/vaccinations/country_data and https://ourworldindata.org/about. The code works perfectly when web scraping a single table and saving it into a data frame...
import requests
from bs4 import BeautifulSoup
import pandas as pd
url = "https://github.com/owid/covid-19-data/blob/master/public/data/vaccinations/country_data/Bangladesh.csv"
response = requests.get(url)
response
scraping_html_table_BD = BeautifulSoup(response.content, "lxml")
scraping_html_table_BD = scraping_html_table_BD.find_all("table", "js-csv-data csv-data js-file-line-container")
df = pd.read_html(str(scraping_html_table_BD))
BD_df = df[0]
But I have not had much luck when trying to create a function to scrape several pages. I have been following the tutorial on this website 3 in the section 'Scrape multiple pages with one script' and StackOverflow questions like 4 and 5 amongst other pages. I have tried creating a global variable first but I end up with errors like "Recursion Error: maximum recursion depth exceeded while calling a Python object". This is the best code I have managed as it doesn't generate an error but I've not managed to save the output to a global variable. I really appreciate your help.
import pandas as pd
from bs4 import BeautifulSoup
import requests
link_list = ['/Bangladesh.csv',
'/Nepal.csv',
'/Mongolia.csv']
def get_info(page_url):
page = requests.get('https://github.com/owid/covid-19-data/tree/master/public/data/vaccinations/country_data' + page_url)
scape = BeautifulSoup(page.text, 'html.parser')
vaccination_rates = scape.find_all("table", "js-csv-data csv-data js-file-line-container")
result = {}
df = pd.read_html(str(vaccination_rates))
vaccination_rates = df[0]
df = pd.DataFrame(vaccination_rates)
print(df)
df.to_csv("testdata.csv", index=False)
for link in link_list:
get_info(link)
edit: I can view the final webpage that is iterated as it saves to a csv file, but not the data from the preceding links.
new = pd.read_csv('testdata6.csv')
pd.set_option("display.max_rows", None, "display.max_columns", None)
new
This is because in every iteration your 'testdata.csv' is overwritten with a new one.
so you can do :
df.to_csv(page_url[1:], index=False)
I'm guessing you're overwriting your 'testdata.csv' each time, hence why you can see the final page. I would either add an enumerate function to add an identifier for a separate csv each time you scrape a page, eg:
for key, link in enumerate(link_list):
get_info(link, key)
...
df.to_csv(f"testdata{key}.csv", index=False)
Or, open this csv as part of your get_info function, steps of which are available in append new row to old csv file python.
(Disclaimer: I'm a Python and web scraping noob, but I'm doing my best to learn).
I'm trying to extract 3 key data points from research studies on clinicaltrials.gov. They have an API, but the API doesn't capture the things I need. I want to get a (1) short description of the study, (2) the Principal Investigator (PI), and (3) some keywords associated with the study. I believe my code captures 1 and 3, but not 2. I can't seem to figure out why I'm not getting the Principal Investigator(s) name. Here are the two sites I have in my code:
https://clinicaltrials.gov/ct2/show/NCT03530579
https://clinicaltrials.gov/ct2/show/NCT03436992
Here's my code (I know the PI code is wrong, but I wanted to demonstrate that I tried) :
import pandas as pd
import requests
from bs4 import BeautifulSoup
import csv
fields=['PI','Project_Summary', 'Keywords']
with open(r'test.csv', 'a') as f:
writer = csv.writer(f)
writer.writerow(fields)
urls = ['https://clinicaltrials.gov/ct2/show/NCT03436992','https://clinicaltrials.gov/ct2/show/NCT03530579']
for url in urls:
response = requests.get(url)
soup = BeautifulSoup(response.content, 'html.parser')
#get_keywords
for rows in soup.find_all("td"):
k = rows.get_text()
Keywords = k.strip()
#get Principal Investigator
PI = soup.find_all('padding:1ex 1em 0px 0px;white-space:nowrap;')
#Get description
Description = soup.find(class_='ct-body3 tr-indent2').get_text()
d = {'Summary2':[PI,Description,Keywords]}
df = pd.DataFrame(d)
print (df)
import csv
fields=[PI,Description, Keywords]
with open(r'test.csv', 'a') as f:
writer = csv.writer(f)
writer.writerow(fields)
You may be able to use the following selector
i.e. PI = soup.select_one('.tr-table_cover [headers=name]').text
import requests
from bs4 import BeautifulSoup
urls = ['https://clinicaltrials.gov/ct2/show/NCT03530579', 'https://clinicaltrials.gov/ct2/show/NCT03436992','https://clinicaltrials.gov/show/NCT03834376']
with requests.Session() as s:
for url in urls:
r = s.get(url)
soup = BeautifulSoup(r.text, "lxml")
item = soup.select_one('.tr-table_cover [headers=name]').text if soup.select_one('.tr-table_cover [headers=name]') is not None else 'No PI'
print(item)
The . is a class selector and the [] is an attribute selector. The space between is a descendant combinator specifying that the element retrieved on the right is a child of that on the left
I simply used pandas to get the tables. this will return a list of dataframes. You could then just iterate through those to look for the PI:
tables = pd.read_html(url)
for table in tables:
try:
if 'Principal Investigator' in table.iloc[0,0]:
pi = table.iloc[0,1]
except:
continue
So there are many ways to walk down the DOM tree and your way is very "brittle". Meaning that your chosen selectors from which to begin your search are extremely specific and bound to CSS styling which can change far easier than the structure of the document as a whole.
But if I were you I'd being filtering down some nodes on some criteria and then focus on that specific group while you sift through the noise.
So looking at those urls that you showed it appears the data is structured neatly and using tables. Based on that we can make some assumptions like
It is data that is inside a table
It will contain the "principal investigator" string inside of it
# get all the tables in the page
tables = soup.find_all('table')
# now filter down to a smaller set of tables that might contain the info
refined_tables = [table for table in tables if 'principal investigator' in str(table).lower()]
At this point we have a strong candidate in our refined_tables list that might actually contain our primary table and is ideally of size 1 assuming the "principal investigator" filter we used isn't anywhere else inside other tables.
principal_investigator = [ele for ele in refined_tables.findAll('td') if 'name' in ele.attrs['headers']][0].text
Right here what was done from looking at the site was that they are using the attribute headers to assign the role of the td tag within the table row.
So in essence, just think of it from a top level and start narrowing down things as much as possible in simple steps to help you find what you're looking for.
I'm currently working on a web scraper that will allow me to pull stats from a football player. Usually this would be an easy task if I could just grab the divs however, this website uses a attribute called data-stats and uses it like a class. This is an example of that.
<th scope="row" class="left " data-stat="year_id">2000</th>
If you would like to check the site for yourself here is the link.
https://www.pro-football-reference.com/players/B/BradTo00.htm
I'm tried a few different methods. Either It won't work at all or I will be able to start a for loop and start putting things into arrays, however you will notice that not everything in the table is the same var type.
Sorry for the formatting and the grammer.
Here is what I have so far, I'm sure its not the best looking code, it's mainly just code I've tried on my own and a few things mixed in from searching on Google. Ignore the random imports I was trying different things
# import libraries
import csv
from datetime import datetime
import requests
from bs4 import BeautifulSoup
import lxml.html as lh
import pandas as pd
# specify url
url = 'https://www.pro-football-reference.com/players/B/BradTo00.htm'
# request html
page = requests.get(url)
# Parse html using BeautifulSoup, you can use a different parser like lxml if present
soup = BeautifulSoup(page.content, 'lxml')
# find searches the given tag (div) with given class attribute and returns the first match it finds
headers = [c.get_text() for c in soup.find(class_ = 'table_container').find_all('td')[0:31]]
data = [[cell.get_text(strip=True) for cell in row.find_all('td')[0:32]]
for row in soup.find_all("tr", class_=True)]
tags = soup.find(data ='pos')
#stats = tags.find_all('td')
print(tags)
You need to use the get method from BeautifulSoup to get the attributes by name
See: BeautifulSoup Get Attribute
Here is a snippet to get all the data you want from the table:
from bs4 import BeautifulSoup
import requests
url = "https://www.pro-football-reference.com/players/B/BradTo00.htm"
r = requests.get(url)
soup = BeautifulSoup(r.text, 'html.parser')
# Get table
table = soup.find(class_="table_outer_container")
# Get head
thead = table.find('thead')
th_head = thead.find_all('th')
for thh in th_head:
# Get case value
print(thh.get_text())
# Get data-stat value
print(thh.get('data-stat'))
# Get body
tbody = table.find('tbody')
tr_body = tbody.find_all('tr')
for trb in tr_body:
# Get id
print(trb.get('id'))
# Get th data
th = trb.find('th')
print(th.get_text())
print(th.get('data-stat'))
for td in trb.find_all('td'):
# Get case value
print(td.get_text())
# Get data-stat value
print(td.get('data-stat'))
# Get footer
tfoot = table.find('tfoot')
thf = tfoot.find('th')
# Get case value
print(thf.get_text())
# Get data-stat value
print(thf.get('data-stat'))
for tdf in tfoot.find_all('td'):
# Get case value
print(tdf.get_text())
# Get data-stat value
print(tdf.get('data-stat'))
You can of course save the data in a csv or even a json instead of printing it
It's not very clear what exactly you're trying to extract, but this might help you a little bit:
import requests
from bs4 import BeautifulSoup as bs
url = 'https://www.pro-football-reference.com/players/B/BradTo00.htm'
page = requests.get(url)
soup = bs(page.text, "html.parser")
# Extract table
table = soup.find_all('table')
# Let's extract data from each row in table
for row in table:
col = row.find_all('td')
for c in col:
print(c.text)
Hope this helps!
I'm trying to extract the information from a single web-page that contains multiple similarly structured recordings. Information is contained within div tags with different classes (I'm interested in username, main text and date). Here is the code I use:
import bs4 as bs
import urllib
import pandas as pd
href = 'https://example.ru/'
sause = urllib.urlopen(href).read()
soup = bs.BeautifulSoup(sause, 'lxml')
user = pd.Series(soup.Series('div', class_='Username'))
main_text = pd.Series(soup.find_all('div', class_='MainText'))
date = pd.Series(soup.find_all('div', class_='Date'))
result = pd.DataFrame()
result = pd.concat([user, main_text, date], axis=1)
The problem is that I receive information with all tags, while I want only a text. Surprisingly, .text attribute doesn't work with find_all method, so now I'm completely out of ides.
Thank you for any help!
list comprehension is the way to go, to get all the text within MainText for example, try
[elem.text for elem in soup.find_all('div', class_='MainText')]
I need to get the href links that the contents point to under a particular column from a table in wikipedia. The page is "http://en.wikipedia.org/wiki/List_of_Telugu_films_of_2015". On this page there are a few tables with class "wikitable". I need the links of the contents under the column Title for each row that they point to. I would like them to be copied onto an excel sheet.
I do not know the exact code of searching under a particular column but I came upto this far and I am getting a "Nonetype object is not callable". I am using bs4. I wanted to extract atleast somepart of the table so I could figure out narrowing to the href links under the Title column I want but I am ending up with this error. The code is as below:
from urllib.request import urlopen
from bs4 import BeautifulSoup
soup = BeautifulSoup(urlopen('http://en.wikipedia.org/wiki/List_of_Telugu_films_of_2015').read())
for row in soup('table', {'class': 'wikitable'})[1].tbody('tr'):
tds = row('td')
print (tds[0].string, tds[0].string)
A little guidance appreciated. Anyone knows?
Figured out that the none type error might be related to the table filtering. Corrected code is as below:
import urllib2
from bs4 import BeautifulSoup, SoupStrainer
content = urllib2.urlopen("http://en.wikipedia.org/wiki/List_of_Telugu_films_of_2015").read()
filter_tag = SoupStrainer("table", {"class":"wikitable"})
soup = BeautifulSoup(content, parse_only=filter_tag)
links=[]
for sp in soup.find_all(align="center"):
a_tag = sp('a')
if a_tag:
links.append(a_tag[0].get('href'))