Why is my website scrape missing the intended table using python? - python

I am trying to use this code to scrape information from Ballotpedia (https://ballotpedia.org/Governor_(state_executive_office)), specifically names of executives. The code I have here is only giving me the following output:
,Governor_(state_executive_office),Lieutenant_Governor_(state_executive_office),Secretary_of_State_(state_executive_office),Attorney_General_(state_executive_office)
I am trying to get the names as well. Here is my current code:
import requests
from bs4 import BeautifulSoup
import pandas as pd
list = ['https://ballotpedia.org/Governor_(state_executive_office)', 'https://ballotpedia.org/Lieutenant_Governor_(state_executive_office)', 'https://ballotpedia.org/Secretary_of_State_(state_executive_office)', 'https://ballotpedia.org/Attorney_General_(state_executive_office)']
temp_dict = {}
for page in list:
r = requests.get(page)
soup = BeautifulSoup(r.content, 'html.parser')
temp_dict[page.split('/')[-1]] = [item.text for item in
soup.select("table.bptable.gray.sortable.tablesorter
tablesorter-default tablesorter17e7f0d6cf4b4 jquery-
tablesorter")]
The very last line is the one in which I believe the problem exists. I have tried removing and adding code to the section "table.bptable.gray.sortable.tablesorter tablesorter-default tablesorter17e7f0d6cf4b4 jquery-tablesorter" but keep getting the same result. I copied it straight from the site no I'm not sure what I'm missing. If not this, is there something wrong with the rest of the code in that line? Thank you!

There's a simpler way to do it. Taking one of your urls at random, try this:
import pandas as pd
tables = pd.read_html("https://ballotpedia.org/Governor_(state_executive_office)")
tables[4]
Output:
Office Name Party Date assumed office
0 Governor of Georgia Brian Kemp Republican January 14, 2019
1 Governor of Tennessee Bill Lee Republican January 15, 2019
2 Governor of Missouri Mike Parson Republican June 1, 2018
etc.

You could try to reach the table via selector:
import requests
from bs4 import BeautifulSoup
import pandas as pd
list = ['https://ballotpedia.org/Governor_(state_executive_office)', 'https://ballotpedia.org/Lieutenant_Governor_(state_executive_office)', 'https://ballotpedia.org/Secretary_of_State_(state_executive_office)', 'https://ballotpedia.org/Attorney_General_(state_executive_office)']
temp_dict = {}
for page in list:
r = requests.get(page)
soup = BeautifulSoup(r.content, 'html.parser')
temp_dict[page.split('/')[-1]] = [item.text for item in soup.select('#officeholder-table')]

Use following css selector to find the table first and then use pandas to read_html()
and load into data frame.
This will give you all data in a single dataframe.
import pandas as pd
import requests
from bs4 import BeautifulSoup
listurl = ['https://ballotpedia.org/Governor_(state_executive_office)', 'https://ballotpedia.org/Lieutenant_Governor_(state_executive_office)', 'https://ballotpedia.org/Secretary_of_State_(state_executive_office)', 'https://ballotpedia.org/Attorney_General_(state_executive_office)']
df1=pd.DataFrame()
for l in listurl:
res=requests.get(l)
soup=BeautifulSoup(res.text,'html.parser')
table=soup.select("table#officeholder-table")[-1]
df= pd.read_html(str(table))[0]
df1=df1.append(df,ignore_index=True)
print(df1)
If you want to fetch individual dataframe then try this.
import pandas as pd
import requests
from bs4 import BeautifulSoup
listurl = ['https://ballotpedia.org/Governor_(state_executive_office)', 'https://ballotpedia.org/Lieutenant_Governor_(state_executive_office)', 'https://ballotpedia.org/Secretary_of_State_(state_executive_office)', 'https://ballotpedia.org/Attorney_General_(state_executive_office)']
for l in listurl:
res=requests.get(l)
soup=BeautifulSoup(res.text,'html.parser')
table=soup.select("table#officeholder-table")[-1]
df= pd.read_html(str(table))[0]
print(df)

Related

How to scrape table data with th and td with BeautifulSoup?

Am new to programming and have been trying to practice web scrapping. Found a example where one of the columns I wish to have in my out put is part of the table header. I am able to extract all the table data I wish, but have been unable to get the Year dates to show.
from bs4 import BeautifulSoup # this module helps in web scrapping.
import requests # this module helps us to download a web page
import pandas as pd
"https://en.wikipedia.org/wiki/World_population"
data = requests.get(url).text
soup = BeautifulSoup(data,"html.parser")
tables = soup.find_all('table')
len(tables)
for index,table in enumerate(tables):
if ("Global annual population growth" in str(table)):
table_index = index
print(table_index)
print(tables[table_index].prettify())
population_data = pd.DataFrame(columns=["Year","Population","Growth"])
for row in tables[table_index].tbody.find_all('tr'):
col = row.find_all('td')
if (col !=[]):
Population = col[0].text.strip()
Growth = col[1].text.strip()
population_data = population_data.append({"Population":Population,"Growth":Growth}, ignore_index= True)
population_data
You could use pandas directly here to get your goal with pandas.read_html() to scrape the table and pandas.T to transform it:
import pandas as pd
df = pd.read_html('https://en.wikipedia.org/wiki/World_population')[0].T.reset_index()
df.columns = df.loc[0]
df = df[1:]
df
or same result with BeautifulSoup and stripped_strings:
import requests
import pandas as pd
from bs4 import BeautifulSoup
soup = BeautifulSoup(requests.get('https://en.wikipedia.org/wiki/World_population').text)
pd.DataFrame(
{list(e.stripped_strings)[0]: list(e.stripped_strings)[1:] for e in soup.table.select('tr')}
)
Output
Population
Year
Years elapsed
1
1804
200,000+
2
1930
126
3
1960
30
4
1974
14
5
1987
13
6
1999
12
7
2011
12
8
2022
11
9
2037
15
10
2057
20
Actually it's because you are scraping <td> in this line:
col = row.find_all('td')
But if you will take a look at <tr> in developer tools(F12), you can see that table also contains <th> tag which keep the year and which you are not scraping. So everything that you have to do is add this line after If condition:
year = row.find('th').text, and after that you can append it in population data

How to loop through scraped items and add them to a dictionary or pandas dataframe?

For a project I'm scraping data from futbin players and I would like to add that scraped data to a dict or pandas dataframe. I'm stuck for a couple of hours and would like some help if possible. I will put my code below on what I have so far. This piece of code only prints out the data and from that I'm clueless about what to do.
Code:
from requests_html import HTMLSession
import requests
from bs4 import BeautifulSoup
import pandas as pd
urls = ['https://www.futbin.com/21/player/87/pele', 'https://www.futbin.com/21/player/27751/robert-lewandowski']
for url in urls:
page = requests.get(url)
soup = BeautifulSoup(page.content, 'html.parser')
info = soup.find('div', id='info_content')
rows = info.find_all('td')
for info in rows:
print(info.text.strip())
The work you have already done to identify the table you want is good.
use read_html() to convert to a dataframe
basic transforms to turn it into columns rather than key value pairs
in list comprehension get details of all wanted footballers
import requests
from bs4 import BeautifulSoup
import pandas as pd
urls = ['https://www.futbin.com/21/player/87/pele', 'https://www.futbin.com/21/player/27751/robert-lewandowski']
def myhtml(url):
# use BS4 to get table that has required data
html = str(BeautifulSoup(requests.get(url).content, 'html.parser').find('div', id='info_content').find("table"))
# read_html() returns a list, take first one, first column are attribute name, transpose to build DF
return pd.read_html(html)[0].set_index(0).T
df = pd.concat([myhtml(u) for u in urls])
Name
Club
Nation
League
Skills
Weak Foot
Intl. Rep
Foot
Height
Weight
Revision
Def. WR
Att. WR
Added on
Origin
R.Face
B.Type
DOB
Robert Lewandowski FIFA 21 Career Mode
Age
1
Edson Arantes Nascimento
FUT 21 ICONS
Brazil
Icons
5
4
5
Right
173cm
5'8"
70
Icon
Med
High
2020-09-10
Prime
nan
Unique
23-10-1940
nan
1
Robert Lewandowski
FC Bayern
Poland
Bundesliga
4
4
4
Right
184cm
6'0"
80
TOTY
Med
High
2021-01-22
TOTY
nan
Unique
nan
Robert Lewandowski FIFA 21 Career Mode
I would do it with open() and write()
file = open ("filename.txt", "w")
The w specifies the following :
"w" - Write - Opens a file for writing, creates the file if it does not exist
And then :
file.write (text_to_save)
Be sure to include os.path!
import os.path

How to grab all tr elements from a table and click on a link?

I am trying to figure out how to print all tr elements from a table, but I can't quite get it working right.
Here is the link I am working with.
https://en.wikipedia.org/wiki/List_of_current_members_of_the_United_States_Senate
Here is my code.
import requests
from bs4 import BeautifulSoup
link = "https://en.wikipedia.org/wiki/List_of_current_members_of_the_United_States_Senate"
html = requests.get(link).text
# If you do not want to use requests then you can use the following code below
# with urllib (the snippet above). It should not cause any issue."""
soup = BeautifulSoup(html, "lxml")
res = soup.findAll("span", {"class": "fn"})
for r in res:
print("Name: " + r.find('a').text)
table_body=soup.find('senators')
rows = table_body.find_all('tr')
for row in rows:
cols=row.find_all('td')
cols=[x.text.strip() for x in cols]
print(cols)
I am trying to print all tr elements from the table named 'senators'. Also, I am wondering if there is a way to click on links of senators, like 'Richard Shelby' which takes me to this:
https://en.wikipedia.org/wiki/Richard_Shelby
From each link, I want to grab the data under 'Assumed office'. In this case the value is: 'January 3, 2018'. So, ultimately, I want to end up with this:
Richard Shelby May 6, 1934 (age 84) Lawyer U.S. House
Alabama Senate January 3, 1987 2022
Assumed office: January 3, 2018
All I can get now is the name of each senator printed out.
In order to locate the "Senators" table, you can first find the corresponding "Senators" label and then get the first following table element:
soup.find(id='Senators').find_next("table")
Now, in order to get the data row by row, you would have to account for the cells with a "rowspan" which stretch across multiple rows. You can either follow the approaches suggested at What should I do when <tr> has rowspan, or the implementation I provide below (not ideal but works in your case).
import copy
import requests
from bs4 import BeautifulSoup
link = "https://en.wikipedia.org/wiki/List_of_current_members_of_the_United_States_Senate"
with requests.Session() as session:
html = session.get(link).text
soup = BeautifulSoup(html, "lxml")
senators_table = soup.find(id='Senators').find_next("table")
headers = [td.get_text(strip=True) for td in senators_table.tr('th')]
rows = senators_table.find_all('tr')
# pre-process table to account for rowspan, TODO: extract into a function
for row_index, tr in enumerate(rows):
for cell_index, td in enumerate(tr('td')):
if 'rowspan' in td.attrs:
rowspan = int(td['rowspan'])
del td.attrs['rowspan']
# insert same td into subsequent rows
for index in range(row_index + 1, row_index + rowspan):
try:
rows[index]('td')[cell_index].insert_after(copy.copy(td))
except IndexError:
continue
# extracting the desired data
rows = senators_table.find_all('tr')[1:]
for row in rows:
cells = [td.get_text(strip=True) for td in row('td')]
print(dict(zip(headers, cells)))
If you want to, then, follow the links to senator "profile" pages, you would first need to extract the link out of the appropriate cell in a row and then use session.get() to "navigate" to it, something along these lines:
senator_link = row.find_all('td')[3].a['href']
senator_link = urljoin(link, senator_link)
response = session.get(senator_link)
soup = BeautifulSoup(response.content, "lxml")
# TODO: parse
where urljoin is imported as:
from urllib.parse import urljoin
Also, FYI, one of the reasons to use requests.Session() here is to optimize making requests to the same host:
The Session object allows you to persist certain parameters across requests. It also persists cookies across all requests made from the Session instance, and will use urllib3’s connection pooling. So if you’re making several requests to the same host, the underlying TCP connection will be reused, which can result in a significant performance increase
There is also an another way to get the tabular data parsed - .read_html() from pandas. You could do:
import pandas as pd
df = pd.read_html(str(senators_table))[0]
print(df.head())
to get the desired table as a dataframe.

Scraping Table using Python and Selenium

I am trying to scrape the table below using python. Tried pulling html tags to find the element id_dt1_NGY00 and so on but cannot find it once the page is populated so someone told me use Selenium and did manage to scrape some data.
https://www.insidefutures.com/markets/data.php?page=quote&sym=ng&x=13&y=8
The numbers are updated every 10 minutes so this website is dynamic. Used the following code below but it is printing out everything in a linear format rather than in a format that can be tabular as rows and columns. Included below are two sections of sample output
Contract
Last
Change
Open
High
Low
Volume
Prev. Stl.
Time
Links
May '21 (NGK21)
2.550s
+0.006
2.550
2.550
2.550
1
2.544
05/21/18
Q / C / O
Jun '21 (NGM21)
2.576s
+0.006
0.000
2.576
2.576
0
2.570
05/21/18
Q / C / O
Code below
import time
from bs4 import BeautifulSoup
from selenium import webdriver
import pandas as pd
browser = webdriver.Chrome(executable_path= "C:\Users\siddk\PycharmProjects\WebSraping\venv\selenium\webdriver\chromedriver.exe")
browser.get("https://www.insidefutures.com/markets/data.php?page=quote&sym=ng&x=14&y=16")
html = browser.page_source
soup = BeautifulSoup(html, 'html.parser')
th_tags = soup.find_all('tr')
for th in th_tags:
print (th.get_text())
I want to extract this data in Panda and analyze averages etc on daily basis. Please help. I have exhausted my strength on doing this myself with multiple iterations to code.
Try the below script to get the tabular data. It is necessary to find the right url which contains the same table but does not get generated dynamically so that you can do your operation without using any browser simulator.
Give it a go:
from bs4 import BeautifulSoup
import requests
url = "https://shared.websol.barchart.com/quotes/quote.php?page=quote&sym=ng&x=13&y=8&domain=if&display_ice=1&enabled_ice_exchanges=&tz=0&ed=0"
res = requests.get(url)
soup = BeautifulSoup(res.text,"lxml")
for tr in soup.find(class_="bcQuoteTable").find_all("tr"):
data = [item.get_text(strip=True) for item in tr.find_all(["th","td"])]
print(data)
Rusults are like:
['Contract', 'Last', 'Change', 'Open', 'High', 'Low', 'Volume', 'Prev. Stl.', 'Time', 'Links']
['Cash (NGY00)', '2.770s', '+0.010', '0.000', '2.770', '2.770', '0', '2.760', '05/21/18', 'Q/C/O']
["Jun \\'18 (NGM18)", '2.901', '-0.007', '2.902', '2.903', '2.899', '138', '2.908', '17:11', 'Q/C/O']
["Jul \\'18 (NGN18)", '2.927', '-0.009', '2.928', '2.930', '2.926', '91', '2.936', '17:11', 'Q/C/O']
["Aug \\'18 (NGQ18)", '2.944', '-0.008', '2.945', '2.947', '2.944', '42', '2.952', '17:10', 'Q/C/O']

Web scraping a webpage for nested table using BeautifulSoup

I am trying to get some information from this page : https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSM2437275
where I am particularly interested in extracting the Characteristics data as follows:
group_id: xxx
medicore_id: xxxxxxx
date_of_visit_sample_drawn_date: xxxxxxx
rin: xxxxxx
donor_id: xxxxx
sle_visit_designation: xxxxxxx
bold_shipment_batch: xxxxxx
rna_concentrated: xxxxxx
subject_type: xxxxxxx
so on and so forth.
Upon inspecting the page, I realize that this information is deeply nested within other larger tables and that there is no special class/id for me to effectively parse for the characteristics information.
I have been unsuccessfully trying to look for table within tables but I find that sometimes not all tables are being read. This is what I have so far:
from bs4 import BeautifulSoup
import requests
source= requests.get("https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?
acc=GSM2437275").text
soup = BeautifulSoup(source, 'lxml')
table = soup.find_all('table')
for i in table:
print i.prettify()
print (len(table)) #22 tables
print (table[6].prettify()) #narrow down on relevant table
table = table[6]
table_subtables = table.find_all('table')
for i in table_subtables:
print (i.prettify())
print len(table_subtables) #14 tables
tbb = table_subtables[1]
tbb_subtable = tbb.find_all('table')
for i in tbb_subtable:
print (i.prettify())
print len(tbb_subtable) #12 tables
tbbb = tbb_subtable[5]
tbbb_subtable = tbbb.find_all('table')
for i in tbbb_subtable:
print (i.prettify())
print len(tbbb_subtable) # 6 tables
so on and so forth. However, as I keep doing this, I find that not all tables are being read. Can someone point me to a better solution?
You can scrape the data with regular expressions and urllib to specifically scrape the keywords and their corresponding values:
import re
import urllib
data = str(urllib.urlopen('https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSM2437275').read())
target_vals = ['group_id', 'medicore_id', 'date_of_visit_sample_drawn_date', 'rin', 'donor_id', 'sle_visit_designation', 'bold_shipment_batch', 'rna_concentrated', 'subject_type']
final_data = {i:re.findall('(?<={}:\s)\w+'.format(i), data)[0] for i in target_vals}
Output:
{
'date_of_visit_sample_drawn_date': '2009',
'rna_concentrated': 'No',
'sle_visit_designation': 'Baseline',
'rin': '8',
'subject_type': 'Patient',
'donor_id': '19',
'bold_shipment_batch': '1',
'medicore_id': 'B0019V1',
'group_id': 'A'
}
Edit: given multiple links, you can create a pandas dataframe out of the generated data for each:
import re
import urllib
import pandas as pd
def get_data_from_links(link, target_vals=['group_id', 'medicore_id', 'date_of_visit_sample_drawn_date', 'rin', 'donor_id', 'sle_visit_designation', 'bold_shipment_batch', 'rna_concentrated', 'subject_type']):
data = str(urllib.urlopen(link).read())
return {i:re.findall('(?<={}:\s)\w+'.format(i), data)[0] for i in target_vals}
returned_data = get_data_from_links('https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSM2437275')
df = pd.DataFrame([returned_data])
Output:
bold_shipment_batch date_of_visit_sample_drawn_date donor_id group_id \
0 1 2009 19 A
medicore_id rin rna_concentrated sle_visit_designation subject_type
0 B0019V1 8 No Baseline Patient
If you have a list of links you would like to retrieve your data from, you can construct a table by constructing a nested dictionary of the resulting data to pass to DataFrame.from_dict:
link_lists = ['link1', 'link2', 'link3']
final_data = {i:get_data_from_links(i) for i in link_lists}
new_table = pd.DataFrame.from_dict(final_data, orient='index')
Output (assuming the first link is 'https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSM2437275'):
rin rna_concentrated date_of_visit_sample_drawn_date \
link1 8 No 2009
sle_visit_designation bold_shipment_batch group_id subject_type \
link1 Baseline 1 A Patient
medicore_id donor_id
link1 B0019V1 19
The way Ajax1234 has shown in his solution is definitely the best way to go with. However, if hardcoded index is not a barrier and if you wish to avoid using regex to achieve the same then this is another approach you may think of trying:
from bs4 import BeautifulSoup
import requests
res = requests.get("https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSM2437275")
soup = BeautifulSoup(res.content, 'lxml')
for items in soup.select("td[style*='justify']")[2:3]:
data = '\n'.join([item for item in items.strings][:9])
print(data)
Output:
group_id: A
medicore_id: B0019V1
date_of_visit_sample_drawn_date: 2009-09-14
rin: 8.5
donor_id: 19
sle_visit_designation: Baseline
bold_shipment_batch: 1
rna_concentrated: No
subject_type: Patient

Categories