Web scraping a webpage for nested table using BeautifulSoup - python

I am trying to get some information from this page : https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSM2437275
where I am particularly interested in extracting the Characteristics data as follows:
group_id: xxx
medicore_id: xxxxxxx
date_of_visit_sample_drawn_date: xxxxxxx
rin: xxxxxx
donor_id: xxxxx
sle_visit_designation: xxxxxxx
bold_shipment_batch: xxxxxx
rna_concentrated: xxxxxx
subject_type: xxxxxxx
so on and so forth.
Upon inspecting the page, I realize that this information is deeply nested within other larger tables and that there is no special class/id for me to effectively parse for the characteristics information.
I have been unsuccessfully trying to look for table within tables but I find that sometimes not all tables are being read. This is what I have so far:
from bs4 import BeautifulSoup
import requests
source= requests.get("https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?
acc=GSM2437275").text
soup = BeautifulSoup(source, 'lxml')
table = soup.find_all('table')
for i in table:
print i.prettify()
print (len(table)) #22 tables
print (table[6].prettify()) #narrow down on relevant table
table = table[6]
table_subtables = table.find_all('table')
for i in table_subtables:
print (i.prettify())
print len(table_subtables) #14 tables
tbb = table_subtables[1]
tbb_subtable = tbb.find_all('table')
for i in tbb_subtable:
print (i.prettify())
print len(tbb_subtable) #12 tables
tbbb = tbb_subtable[5]
tbbb_subtable = tbbb.find_all('table')
for i in tbbb_subtable:
print (i.prettify())
print len(tbbb_subtable) # 6 tables
so on and so forth. However, as I keep doing this, I find that not all tables are being read. Can someone point me to a better solution?

You can scrape the data with regular expressions and urllib to specifically scrape the keywords and their corresponding values:
import re
import urllib
data = str(urllib.urlopen('https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSM2437275').read())
target_vals = ['group_id', 'medicore_id', 'date_of_visit_sample_drawn_date', 'rin', 'donor_id', 'sle_visit_designation', 'bold_shipment_batch', 'rna_concentrated', 'subject_type']
final_data = {i:re.findall('(?<={}:\s)\w+'.format(i), data)[0] for i in target_vals}
Output:
{
'date_of_visit_sample_drawn_date': '2009',
'rna_concentrated': 'No',
'sle_visit_designation': 'Baseline',
'rin': '8',
'subject_type': 'Patient',
'donor_id': '19',
'bold_shipment_batch': '1',
'medicore_id': 'B0019V1',
'group_id': 'A'
}
Edit: given multiple links, you can create a pandas dataframe out of the generated data for each:
import re
import urllib
import pandas as pd
def get_data_from_links(link, target_vals=['group_id', 'medicore_id', 'date_of_visit_sample_drawn_date', 'rin', 'donor_id', 'sle_visit_designation', 'bold_shipment_batch', 'rna_concentrated', 'subject_type']):
data = str(urllib.urlopen(link).read())
return {i:re.findall('(?<={}:\s)\w+'.format(i), data)[0] for i in target_vals}
returned_data = get_data_from_links('https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSM2437275')
df = pd.DataFrame([returned_data])
Output:
bold_shipment_batch date_of_visit_sample_drawn_date donor_id group_id \
0 1 2009 19 A
medicore_id rin rna_concentrated sle_visit_designation subject_type
0 B0019V1 8 No Baseline Patient
If you have a list of links you would like to retrieve your data from, you can construct a table by constructing a nested dictionary of the resulting data to pass to DataFrame.from_dict:
link_lists = ['link1', 'link2', 'link3']
final_data = {i:get_data_from_links(i) for i in link_lists}
new_table = pd.DataFrame.from_dict(final_data, orient='index')
Output (assuming the first link is 'https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSM2437275'):
rin rna_concentrated date_of_visit_sample_drawn_date \
link1 8 No 2009
sle_visit_designation bold_shipment_batch group_id subject_type \
link1 Baseline 1 A Patient
medicore_id donor_id
link1 B0019V1 19

The way Ajax1234 has shown in his solution is definitely the best way to go with. However, if hardcoded index is not a barrier and if you wish to avoid using regex to achieve the same then this is another approach you may think of trying:
from bs4 import BeautifulSoup
import requests
res = requests.get("https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSM2437275")
soup = BeautifulSoup(res.content, 'lxml')
for items in soup.select("td[style*='justify']")[2:3]:
data = '\n'.join([item for item in items.strings][:9])
print(data)
Output:
group_id: A
medicore_id: B0019V1
date_of_visit_sample_drawn_date: 2009-09-14
rin: 8.5
donor_id: 19
sle_visit_designation: Baseline
bold_shipment_batch: 1
rna_concentrated: No
subject_type: Patient

Related

How to narrow down the soup.find result and output only relevant text?

Every football players wikipedia page has something named "infobox" where the career is displayed.
My goal is to scrape only the highlighted data from wikipedia pages of football players.
I have gotten this far, im able to output the "infobox" segment of the player in text like this. But the only information I want is the highlighted one.
How do I narrow the result so I only get the highlighted text as my output?
If you feel like you might now the answer please ask questions if necessary because I feel like it is hard to formulate my question good.
The infobox table is a succession of <tr></tr tags.
Globally we are looking for the <tr></tr tag located immediately after the one whose text is "Seniorlag*"
You could do it like this:
import requests
from bs4 import BeautifulSoup
url = "https://sv.wikipedia.org/wiki/Lionel_Messi"
page = requests.get(url)
soup = BeautifulSoup(page.content, 'html.parser')
infobox = soup.find('table', {'class': 'infobox'})
tr_tags = infobox.find_all('tr')
for tr in tr_tags:
if tr.text == "Seniorlag*":
# Search for the following tr tag
next_tr = tr.find_next_sibling('tr')
print(next_tr.text)
output
År2003–20042004–20052004–20212021–
Klubb Barcelona C Barcelona B Barcelona Paris Saint-Germain
SM (GM) 10 00(5)22 00(6)520 (474)39 0(13)
Just in addition to approach of #Vincent Lagache, that answers the question well, you could also deal with css selectors (more) to find your elements:
soup.select_one('tr:has(th:-soup-contains("Seniorlag")) + tr').text
Invoke dict comprehension and stripped_strings to extract the strings:
{
list(e.stripped_strings)[0]:list(e.stripped_strings)[1:]
for e in soup.select('tr:has(th:-soup-contains("Seniorlag")) + tr table td')
}
This results in a dict that on the one hand is already structured and can therefore be easily reused, for example creating a Dataframe
{'År': ['2003–2004', '2004–2005', '2004–2021', '2021–'], 'Klubb': ['Barcelona C', 'Barcelona B', 'Barcelona', 'Paris Saint-Germain'], 'SM (GM)': ['10', '(5)', '22', '(6)', '520 (474)', '39', '(13)']}
Example
This example also includes some pre- and postprocessing steps like decompose() to eliminate unwanted tags and splitting column with tuples with pandas
import requests
import pandas as pd
from bs4 import BeautifulSoup
url='https://sv.wikipedia.org/wiki/Lionel_Messi'
soup = BeautifulSoup(requests.get(url).text)
for hidden in soup.find_all(style='visibility:hidden;color:transparent;'):
hidden.decompose()
d = {
list(e.stripped_strings)[0]:list(e.stripped_strings)[1:]
for e in soup.select('tr:has(th:-soup-contains("Seniorlag")) + tr table td')
}
d['SM (GM)'] = ' '.join(d['SM (GM)']).split()
d['SM (GM)'] = list(zip(d['SM (GM)'][0::2], d['SM (GM)'][1::2]))
df = pd.DataFrame(d)
df[['SM', 'GM']] = pd.DataFrame(df['SM (GM)'].tolist(), index=df.index)
df
Output
År
Klubb
SM (GM)
SM
GM
0
2003–2004
Barcelona C
('10', '(5)')
10
(5)
1
2004–2005
Barcelona B
('22', '(6)')
22
(6)
2
2004–2021
Barcelona
('520', '(474)')
520
(474)
3
2021–
Paris Saint-Germain
('39', '(13)')
39
(13)

Why is my website scrape missing the intended table using python?

I am trying to use this code to scrape information from Ballotpedia (https://ballotpedia.org/Governor_(state_executive_office)), specifically names of executives. The code I have here is only giving me the following output:
,Governor_(state_executive_office),Lieutenant_Governor_(state_executive_office),Secretary_of_State_(state_executive_office),Attorney_General_(state_executive_office)
I am trying to get the names as well. Here is my current code:
import requests
from bs4 import BeautifulSoup
import pandas as pd
list = ['https://ballotpedia.org/Governor_(state_executive_office)', 'https://ballotpedia.org/Lieutenant_Governor_(state_executive_office)', 'https://ballotpedia.org/Secretary_of_State_(state_executive_office)', 'https://ballotpedia.org/Attorney_General_(state_executive_office)']
temp_dict = {}
for page in list:
r = requests.get(page)
soup = BeautifulSoup(r.content, 'html.parser')
temp_dict[page.split('/')[-1]] = [item.text for item in
soup.select("table.bptable.gray.sortable.tablesorter
tablesorter-default tablesorter17e7f0d6cf4b4 jquery-
tablesorter")]
The very last line is the one in which I believe the problem exists. I have tried removing and adding code to the section "table.bptable.gray.sortable.tablesorter tablesorter-default tablesorter17e7f0d6cf4b4 jquery-tablesorter" but keep getting the same result. I copied it straight from the site no I'm not sure what I'm missing. If not this, is there something wrong with the rest of the code in that line? Thank you!
There's a simpler way to do it. Taking one of your urls at random, try this:
import pandas as pd
tables = pd.read_html("https://ballotpedia.org/Governor_(state_executive_office)")
tables[4]
Output:
Office Name Party Date assumed office
0 Governor of Georgia Brian Kemp Republican January 14, 2019
1 Governor of Tennessee Bill Lee Republican January 15, 2019
2 Governor of Missouri Mike Parson Republican June 1, 2018
etc.
You could try to reach the table via selector:
import requests
from bs4 import BeautifulSoup
import pandas as pd
list = ['https://ballotpedia.org/Governor_(state_executive_office)', 'https://ballotpedia.org/Lieutenant_Governor_(state_executive_office)', 'https://ballotpedia.org/Secretary_of_State_(state_executive_office)', 'https://ballotpedia.org/Attorney_General_(state_executive_office)']
temp_dict = {}
for page in list:
r = requests.get(page)
soup = BeautifulSoup(r.content, 'html.parser')
temp_dict[page.split('/')[-1]] = [item.text for item in soup.select('#officeholder-table')]
Use following css selector to find the table first and then use pandas to read_html()
and load into data frame.
This will give you all data in a single dataframe.
import pandas as pd
import requests
from bs4 import BeautifulSoup
listurl = ['https://ballotpedia.org/Governor_(state_executive_office)', 'https://ballotpedia.org/Lieutenant_Governor_(state_executive_office)', 'https://ballotpedia.org/Secretary_of_State_(state_executive_office)', 'https://ballotpedia.org/Attorney_General_(state_executive_office)']
df1=pd.DataFrame()
for l in listurl:
res=requests.get(l)
soup=BeautifulSoup(res.text,'html.parser')
table=soup.select("table#officeholder-table")[-1]
df= pd.read_html(str(table))[0]
df1=df1.append(df,ignore_index=True)
print(df1)
If you want to fetch individual dataframe then try this.
import pandas as pd
import requests
from bs4 import BeautifulSoup
listurl = ['https://ballotpedia.org/Governor_(state_executive_office)', 'https://ballotpedia.org/Lieutenant_Governor_(state_executive_office)', 'https://ballotpedia.org/Secretary_of_State_(state_executive_office)', 'https://ballotpedia.org/Attorney_General_(state_executive_office)']
for l in listurl:
res=requests.get(l)
soup=BeautifulSoup(res.text,'html.parser')
table=soup.select("table#officeholder-table")[-1]
df= pd.read_html(str(table))[0]
print(df)

How to pull specific columns from a Wikipedia table using python/Beautiful Soup

I've really been stumped for a while on this.
Link to table = https://en.wikipedia.org/wiki/List_of_Manchester_United_F.C._seasons
I want to pull the data in the columns highlighed in red below
And put it in a pandas dataframe like this
Here is my code
import urllib.request
url = "https://en.wikipedia.org/wiki/List_of_Manchester_United_F.C._seasons"
page = urllib.request.urlopen(url)
from bs4 import BeautifulSoup
soup = BeautifulSoup(page, "lxml")
# print(soup.prettify())
my_table = soup.find('table', {'class':'wikitable sortable'})
season = []
data = []
for row in my_table.find_all('tr'):
s = row.find('th')
season.append(s)
d = row.find('td')
data.append(d)
import pandas as pd
c = {'Season': season, 'Data': data}
df = pd.DataFrame(c)
df
Heres's my output. I'm completely lost on how to get to the simple 5 column table above. Thanks
You are almost there, though you don't really need beautifulsoup for that; just pandas.
Try this:
url = "https://en.wikipedia.org/wiki/List_of_Manchester_United_F.C._seasons"
resp = requests.get(url)
tables = pd.read_html(resp.text)
target = tables[2].iloc[:,[0,2,3,4,5]]
target
Output:
Season P W D L
Season League League League League
0 1886–87 NaN NaN NaN NaN
1 1888–89[9] 12 8 2 2
2 1889–90 22 9 2 11
etc. And you can take it from there.

How to grab all tr elements from a table and click on a link?

I am trying to figure out how to print all tr elements from a table, but I can't quite get it working right.
Here is the link I am working with.
https://en.wikipedia.org/wiki/List_of_current_members_of_the_United_States_Senate
Here is my code.
import requests
from bs4 import BeautifulSoup
link = "https://en.wikipedia.org/wiki/List_of_current_members_of_the_United_States_Senate"
html = requests.get(link).text
# If you do not want to use requests then you can use the following code below
# with urllib (the snippet above). It should not cause any issue."""
soup = BeautifulSoup(html, "lxml")
res = soup.findAll("span", {"class": "fn"})
for r in res:
print("Name: " + r.find('a').text)
table_body=soup.find('senators')
rows = table_body.find_all('tr')
for row in rows:
cols=row.find_all('td')
cols=[x.text.strip() for x in cols]
print(cols)
I am trying to print all tr elements from the table named 'senators'. Also, I am wondering if there is a way to click on links of senators, like 'Richard Shelby' which takes me to this:
https://en.wikipedia.org/wiki/Richard_Shelby
From each link, I want to grab the data under 'Assumed office'. In this case the value is: 'January 3, 2018'. So, ultimately, I want to end up with this:
Richard Shelby May 6, 1934 (age 84) Lawyer U.S. House
Alabama Senate January 3, 1987 2022
Assumed office: January 3, 2018
All I can get now is the name of each senator printed out.
In order to locate the "Senators" table, you can first find the corresponding "Senators" label and then get the first following table element:
soup.find(id='Senators').find_next("table")
Now, in order to get the data row by row, you would have to account for the cells with a "rowspan" which stretch across multiple rows. You can either follow the approaches suggested at What should I do when <tr> has rowspan, or the implementation I provide below (not ideal but works in your case).
import copy
import requests
from bs4 import BeautifulSoup
link = "https://en.wikipedia.org/wiki/List_of_current_members_of_the_United_States_Senate"
with requests.Session() as session:
html = session.get(link).text
soup = BeautifulSoup(html, "lxml")
senators_table = soup.find(id='Senators').find_next("table")
headers = [td.get_text(strip=True) for td in senators_table.tr('th')]
rows = senators_table.find_all('tr')
# pre-process table to account for rowspan, TODO: extract into a function
for row_index, tr in enumerate(rows):
for cell_index, td in enumerate(tr('td')):
if 'rowspan' in td.attrs:
rowspan = int(td['rowspan'])
del td.attrs['rowspan']
# insert same td into subsequent rows
for index in range(row_index + 1, row_index + rowspan):
try:
rows[index]('td')[cell_index].insert_after(copy.copy(td))
except IndexError:
continue
# extracting the desired data
rows = senators_table.find_all('tr')[1:]
for row in rows:
cells = [td.get_text(strip=True) for td in row('td')]
print(dict(zip(headers, cells)))
If you want to, then, follow the links to senator "profile" pages, you would first need to extract the link out of the appropriate cell in a row and then use session.get() to "navigate" to it, something along these lines:
senator_link = row.find_all('td')[3].a['href']
senator_link = urljoin(link, senator_link)
response = session.get(senator_link)
soup = BeautifulSoup(response.content, "lxml")
# TODO: parse
where urljoin is imported as:
from urllib.parse import urljoin
Also, FYI, one of the reasons to use requests.Session() here is to optimize making requests to the same host:
The Session object allows you to persist certain parameters across requests. It also persists cookies across all requests made from the Session instance, and will use urllib3’s connection pooling. So if you’re making several requests to the same host, the underlying TCP connection will be reused, which can result in a significant performance increase
There is also an another way to get the tabular data parsed - .read_html() from pandas. You could do:
import pandas as pd
df = pd.read_html(str(senators_table))[0]
print(df.head())
to get the desired table as a dataframe.

Scraping Table using Python and Selenium

I am trying to scrape the table below using python. Tried pulling html tags to find the element id_dt1_NGY00 and so on but cannot find it once the page is populated so someone told me use Selenium and did manage to scrape some data.
https://www.insidefutures.com/markets/data.php?page=quote&sym=ng&x=13&y=8
The numbers are updated every 10 minutes so this website is dynamic. Used the following code below but it is printing out everything in a linear format rather than in a format that can be tabular as rows and columns. Included below are two sections of sample output
Contract
Last
Change
Open
High
Low
Volume
Prev. Stl.
Time
Links
May '21 (NGK21)
2.550s
+0.006
2.550
2.550
2.550
1
2.544
05/21/18
Q / C / O
Jun '21 (NGM21)
2.576s
+0.006
0.000
2.576
2.576
0
2.570
05/21/18
Q / C / O
Code below
import time
from bs4 import BeautifulSoup
from selenium import webdriver
import pandas as pd
browser = webdriver.Chrome(executable_path= "C:\Users\siddk\PycharmProjects\WebSraping\venv\selenium\webdriver\chromedriver.exe")
browser.get("https://www.insidefutures.com/markets/data.php?page=quote&sym=ng&x=14&y=16")
html = browser.page_source
soup = BeautifulSoup(html, 'html.parser')
th_tags = soup.find_all('tr')
for th in th_tags:
print (th.get_text())
I want to extract this data in Panda and analyze averages etc on daily basis. Please help. I have exhausted my strength on doing this myself with multiple iterations to code.
Try the below script to get the tabular data. It is necessary to find the right url which contains the same table but does not get generated dynamically so that you can do your operation without using any browser simulator.
Give it a go:
from bs4 import BeautifulSoup
import requests
url = "https://shared.websol.barchart.com/quotes/quote.php?page=quote&sym=ng&x=13&y=8&domain=if&display_ice=1&enabled_ice_exchanges=&tz=0&ed=0"
res = requests.get(url)
soup = BeautifulSoup(res.text,"lxml")
for tr in soup.find(class_="bcQuoteTable").find_all("tr"):
data = [item.get_text(strip=True) for item in tr.find_all(["th","td"])]
print(data)
Rusults are like:
['Contract', 'Last', 'Change', 'Open', 'High', 'Low', 'Volume', 'Prev. Stl.', 'Time', 'Links']
['Cash (NGY00)', '2.770s', '+0.010', '0.000', '2.770', '2.770', '0', '2.760', '05/21/18', 'Q/C/O']
["Jun \\'18 (NGM18)", '2.901', '-0.007', '2.902', '2.903', '2.899', '138', '2.908', '17:11', 'Q/C/O']
["Jul \\'18 (NGN18)", '2.927', '-0.009', '2.928', '2.930', '2.926', '91', '2.936', '17:11', 'Q/C/O']
["Aug \\'18 (NGQ18)", '2.944', '-0.008', '2.945', '2.947', '2.944', '42', '2.952', '17:10', 'Q/C/O']

Categories