how to web scrape a nested table from html with python selenium

how to web scrape a nested table from html with python selenium - python

i want to scrape a web nested table with python selenium. The table format has 4 columns x 10 rows. The 4th column has an inner cell containing 6 spans storing 6 images in each row.
My problem is i can only scrape the first 3 columns but cannot show the 4th column data with 6 image src in correct row order.
row = mstable.find_elements_by_xpath('//*[#id="resultMainTable"]/div/div')
column = mstable.find_elements_by_xpath('//*[#id="resultMainTable"]/div/div[1]/div')
column_4th = mstable.find_elements_by_xpath('//*[#id="resultMainTable"]/div/div/div[4]')
innercell_column_4th = mstable.find_elements_by_xpath('//*[#id="resultMainTable"]/div/div/div[4]/span[1]/img')
span_1 = mstable.find_elements_by_xpath('//*[#id="resultMainTable"]/div/div/div/span[1]/img')
span_2 = mstable.find_elements_by_xpath('//*[#id="resultMainTable"]/div/div/div/span[2]/img')
for new_span_1 in span_1:
span_1_img = (new_span_1.get_attribute('src'))
for new_span_2 in span_2:
span_2_img = (new_span_2.get_attribute('src'))
for new_row in row:
print ((new_row.text), (span_1_img), (span_2_img))

I would recommend you to use selenium along with BeautifulSoup . In the BeautifulSoup class when it ask about page source use the selenium function called selenium.page_source instead of requests module which cant recognise javascript.

Related

Scraping OSHA website using BeautifulSoup

I'm looking for help with two main things: (1) scraping a web page and (2) turning the scraped data into a pandas dataframe (mostly so I can output as .csv, but just creating a pandas df is enough for now). Here is what I have done so far for both:
(1) Scraping the web site:
I am trying to scrape this page: https://www.osha.gov/pls/imis/establishment.inspection_detail?id=1285328.015&id=1284178.015&id=1283809.015&id=1283549.015&id=1282631.015. My end goal is to create a dataframe that would ideally contain only the information I am looking for (i.e. I'd be able to select only the parts of the site that I am interested in for my df); it's OK if I have to pull in all the data for now.
As you can see from the URL as well as the ID hyperlinks underneath "Quick Link Reference" at the top of the page, there are five distinct records on this page. I would like each of these IDs/records to be treated as an individual row in my pandas df.
EDIT: Thanks to a helpful comment, I'm including an example of what I would ultimately want in the table below. The first row represents column headers/names and the second row represents the first inspection.
inspection_id open_date inspection_type close_conference close_case violations_serious_initial
1285328.015 12/28/2017 referral 12/28/2017 06/21/2018 2
Mostly relying on BeautifulSoup4, I've tried a few different options to get at the page elements I'm interested in:
# This is meant to give you the first instance of Case Status, which in the case of this page is "CLOSED".
case_status_template = html_soup.head.find('div', {"id" : "maincontain"},
class_ = "container").div.find('table', class_ = "table-bordered").find('strong').text
# I wasn't able to get the remaining Case Statuses with find_next_sibling or find_all, so I used a different method:
for table in html_soup.find_all('table', class_= "table-bordered"):
print(table.text)
# This gave me the output I needed (i.e. the Case Status for all five records on the page),
# but didn't give me the structure I wanted and didn't really allow me to connect to the other data on the page.
# I was also able to get to the same place with another page element, Inspection Details.
# This is the information reflected on the page after "Inspection: ", directly below Case Status.
insp_details_template = html_soup.head.find('div', {"id" : "maincontain"},
class_ = "container").div.find('table', class_ = "table-unbordered")
for div in html_soup.find_all('table', class_ = "table-unbordered"):
print(div.text)
# Unfortunately, although I could get these two pieces of information to print,
# I realized I would have a hard time getting the rest of the information for each record.
# I also knew that it would be hard to connect/roll all of these up at the record level.
So, I tried a slightly different approach. By focusing instead on a version of that page with a single inspection record, I thought maybe I could just hack it by using this bit of code:
url = 'https://www.osha.gov/pls/imis/establishment.inspection_detail?id=1285328.015'
response = get(url)
html_soup = BeautifulSoup(response.text, 'html.parser')
first_table = html_soup.find('table', class_ = "table-borderedu")
first_table_rows = first_table.find_all('tr')
for tr in first_table_rows:
td = tr.find_all('td')
row = [i.text for i in td]
print(row)
# Then, actually using pandas to get the data into a df and out as a .csv.
dfs_osha = pd.read_html('https://www.osha.gov/pls/imis/establishment.inspection_detail?id=1285328.015',header=1)
for df in dfs_osha:
print(df)
path = r'~\foo'
dfs_osha = pd.read_html('https://www.osha.gov/pls/imis/establishment.inspection_detail?id=1285328.015',header=1)
for df[1,3] in dfs_osha:
df.to_csv(os.path.join(path,r'osha_output_table1_012320.csv'))
# This worked better, but didn't actually give me all of the data on the page,
# and wouldn't be replicable for the other four inspection records I'm interested in.
So, finally, I found a pretty handy example here: https://levelup.gitconnected.com/quick-web-scraping-with-python-beautiful-soup-4dde18468f1f. I was trying to work through it, and had gotten as far as coming up with this code:
for elem in all_content_raw_lxml:
wrappers = elem.find_all('div', class_ = "row-fluid")
for x in wrappers:
case_status = x.find('div', class_ = "text-center")
print(case_status)
insp_details = x.find('div', class_ = "table-responsive")
for tr in insp_details:
td = tr.find_all('td')
td_row = [i.text for i in td]
print(td_row)
violation_items = insp_details.find_next_sibling('div', class_ = "table-responsive")
for tr in violation_items:
tr = tr.find_all('tr')
tr_row = [i.text for i in tr]
print(tr_row)
print('---------------')
Unfortunately, I ran into too many bugs with this to be able to use it so I was forced to abandon the project until I got some further guidance. Hopefully the code I've shared so far at least shows the effort I've put in, even if it doesn't do much to get to the final output! Thanks.

For this type of page you don't really need beautifulsoup; pandas is enough.
url = 'your url above'
import pandas as pd
#use pandas to read the tables on the page; there are lots of them...
tables = pd.read_html(url)
#Select from this list of tables only those tables you need:
incident = [] #initialize a list of inspections
for i, table in enumerate(tables): #we need to find the index position of this table in the list; more below
if table.shape[1]==5: #all relevant tables have this shape
case = [] #initialize a list of inspection items you are interested in
case.append(table.iat[1,0]) #this is the location in the table of this particular item
case.append(table.iat[1,2].split(' ')[2]) #the string in the cell needs to be cleaned up a bit...
case.append(table.iat[9,1])
case.append(table.iat[12,3])
case.append(table.iat[13,3])
case.append(tables[i+2].iat[0,1]) #this particular item is in a table which 2 positions down from the current one; this is where the index position of the current table comes handy
incident.append(case)
columns = ["inspection_id", "open_date", "inspection_type", "close_conference", "close_case", "violations_serious_initial"]
df2 = pd.DataFrame(incident,columns=columns)
df2
Output (pardon the formatting):
inspection_id open_date inspection_type close_conference close_case violations_serious_initial
0 Nr: 1285328.015 12/28/2017 Referral 12/28/2017 06/21/2018 2
1 Nr: 1283809.015 12/18/2017 Complaint 12/18/2017 05/24/2018 5
2 Nr: 1284178.015 12/18/2017 Accident 05/17/2018 09/17/2018 1
3 Nr: 1283549.015 12/13/2017 Referral 12/13/2017 05/22/2018 3
4 Nr: 1282631.015 12/12/2017 Fat/Cat 12/12/2017 11/16/2018 1

BS4 webscraping to CSV file, think i am grabbing too may rows ('tr')s

My webscrape code grabs more rows of data than i need. I would like to grab rows per player, looks like these "tr" all include:-
<tr class="diff-row evTabRow bc"
Also the TD data that i want to grab is the:-
data-odig=
from below list of table data:-
<td class="bc bs o" data-bk="B3" data-odig="9" data-o="8" data-hcap="" data-fodds="9.0" data-ew-denom="4" data-ew-places="5" xpath="1"><p>9</p></td>
the code is picking up the
data-o=
td which is problematic for me as is sometimes expressed as a fraction.
Any advice appreciated
I am new to coding, python my first try.
My code has been written mainly from what i have picked up from youtube and copied others trying to fit my needs. I have tried to edit to be specific about the type of table rows and data to include but just cannot find an answer that works (numerous syntax errors). I suspect i have a line or two that is not doing anything also.
url = 'https://www.oddschecker.com/golf/the-masters/2020-us-masters/winner'
r = requests.get(url,headers = header)
soup = BeautifulSoup(r.text,'lxml')
table = soup.findAll("table")[1]
rows_list = []
for rows in table.findAll('tr'):
cell_list = []
for cell in rows.findAll('td'):
text=cell.text
cell_list.append(text)
rows_list.append(cell_list)

find() and findAll()/find_all() can get other arguments to filter results
findAll('tr', {'class': 'diff-row evTabRow bc'})
or
findAll('tr', class_='diff-row evTabRow bc')
You can use True if attribute has to exists but it may have different values
findAll('td', {'data-o': True})
See more in documentation for BeautifulSoup

Relative XPath Wrongly Selects Same Element in Loop

I'm scraping some data.
One of the data points I need is the date, but the table cells containing this data only include months and days. Luckily the year is used as a headline element to categorize the tables.
For some reason year = table.find_element(...) is selecting the same element for every iteration.
I would expect year = table.find_element(...) to select unique elements relative to each unique table element as it loops through all of them, but this isn't the case.
Actual Output
# random, hypothetical values
Page #1
element="921"
element="921"
element="921"
...
Page #2
element="1283"
element="1283"
element="1283"
...
Expected Output
# random, hypothetical values
Page #1
element="921"
element="922"
element="923"
...
Page #2
element="1283"
element="1284"
element="1285"
...
How come the following code selects the same element for every iteration on each page?
# -*- coding: utf-8 -*-
from selenium import webdriver
from selenium.webdriver import Firefox
from selenium.webdriver.common.by import By
links_sc2 = [
'https://liquipedia.net/starcraft2/Premier_Tournaments',
'https://liquipedia.net/starcraft2/Major_Tournaments',
'https://liquipedia.net/starcraft2/Minor_Tournaments',
'https://liquipedia.net/starcraft2/Minor_Tournaments/HotS',
'https://liquipedia.net/starcraft2/Minor_Tournaments/WoL'
]
ff = webdriver.Firefox(executable_path=r'C:\\WebDriver\\geckodriver.exe')
urls = []
for link in links_sc2:
tables = ff.find_elements(By.XPATH, '//h2/following::table')
for table in tables:
try:
# premier, major
year = table.find_element(By.XPATH, './preceding-sibling::h3/span').text
except:
# minor
year = table.find_element(By.XPATH, './preceding-sibling::h2/span').text
print(year)
ff.quit()

You need to use ./preceding-sibling::h3[1]/span to get the nearest h3 sibling from the context element(your table).
The preceding-sibling works like this:
./preceding-sibling::h3 will return the first h3 sibling in DOM
order, which is year 2019 for you.
But if you use indexing, then ./preceding-sibling::h3[1] will
return the nearest h3 element from the context element and further
indexing reaches to the next match in reverse of DOM order. You can also use ./preceding-sibling::h3[last()] go get the farthest sibling.

How to grab all tr elements from a table and click on a link?

I am trying to figure out how to print all tr elements from a table, but I can't quite get it working right.
Here is the link I am working with.
https://en.wikipedia.org/wiki/List_of_current_members_of_the_United_States_Senate
Here is my code.
import requests
from bs4 import BeautifulSoup
link = "https://en.wikipedia.org/wiki/List_of_current_members_of_the_United_States_Senate"
html = requests.get(link).text
# If you do not want to use requests then you can use the following code below
# with urllib (the snippet above). It should not cause any issue."""
soup = BeautifulSoup(html, "lxml")
res = soup.findAll("span", {"class": "fn"})
for r in res:
print("Name: " + r.find('a').text)
table_body=soup.find('senators')
rows = table_body.find_all('tr')
for row in rows:
cols=row.find_all('td')
cols=[x.text.strip() for x in cols]
print(cols)
I am trying to print all tr elements from the table named 'senators'. Also, I am wondering if there is a way to click on links of senators, like 'Richard Shelby' which takes me to this:
https://en.wikipedia.org/wiki/Richard_Shelby
From each link, I want to grab the data under 'Assumed office'. In this case the value is: 'January 3, 2018'. So, ultimately, I want to end up with this:
Richard Shelby May 6, 1934 (age 84) Lawyer U.S. House
Alabama Senate January 3, 1987 2022
Assumed office: January 3, 2018
All I can get now is the name of each senator printed out.

In order to locate the "Senators" table, you can first find the corresponding "Senators" label and then get the first following table element:
soup.find(id='Senators').find_next("table")
Now, in order to get the data row by row, you would have to account for the cells with a "rowspan" which stretch across multiple rows. You can either follow the approaches suggested at What should I do when <tr> has rowspan, or the implementation I provide below (not ideal but works in your case).
import copy
import requests
from bs4 import BeautifulSoup
link = "https://en.wikipedia.org/wiki/List_of_current_members_of_the_United_States_Senate"
with requests.Session() as session:
html = session.get(link).text
soup = BeautifulSoup(html, "lxml")
senators_table = soup.find(id='Senators').find_next("table")
headers = [td.get_text(strip=True) for td in senators_table.tr('th')]
rows = senators_table.find_all('tr')
# pre-process table to account for rowspan, TODO: extract into a function
for row_index, tr in enumerate(rows):
for cell_index, td in enumerate(tr('td')):
if 'rowspan' in td.attrs:
rowspan = int(td['rowspan'])
del td.attrs['rowspan']
# insert same td into subsequent rows
for index in range(row_index + 1, row_index + rowspan):
try:
rows[index]('td')[cell_index].insert_after(copy.copy(td))
except IndexError:
continue
# extracting the desired data
rows = senators_table.find_all('tr')[1:]
for row in rows:
cells = [td.get_text(strip=True) for td in row('td')]
print(dict(zip(headers, cells)))
If you want to, then, follow the links to senator "profile" pages, you would first need to extract the link out of the appropriate cell in a row and then use session.get() to "navigate" to it, something along these lines:
senator_link = row.find_all('td')[3].a['href']
senator_link = urljoin(link, senator_link)
response = session.get(senator_link)
soup = BeautifulSoup(response.content, "lxml")
# TODO: parse
where urljoin is imported as:
from urllib.parse import urljoin
Also, FYI, one of the reasons to use requests.Session() here is to optimize making requests to the same host:
The Session object allows you to persist certain parameters across requests. It also persists cookies across all requests made from the Session instance, and will use urllib3’s connection pooling. So if you’re making several requests to the same host, the underlying TCP connection will be reused, which can result in a significant performance increase
There is also an another way to get the tabular data parsed - .read_html() from pandas. You could do:
import pandas as pd
df = pd.read_html(str(senators_table))[0]
print(df.head())
to get the desired table as a dataframe.

Extract all data from a dynamic HTML table

Here is my issue :
For a Excel writing app, I'm extracting data from an HTML table.
I have a website which contains the table, I can go through it and extract data.
BUT
as the table shows only 20 rows, I can only extract the first 20 rows and not the whole table (which row numbers are pretty random).
Note that the HTML table reset his td/ID as row0 to row19 each time you scroll down (probably usual but I'm not an HTML pro :D )
I have no idea how I could go through the whole table with no duplicates of row data.
If anyone has an idea, you're welcome !
Edit 1 :
here is the HTML (I've filtered it to have only col1 as I need for my extract)
`https://jsfiddle.net/yfb429Lo/13/`
Indeed, there is a scroll tab on the right of the table as on the screenshot here :
Table_screenshot
When I scroll 2 times downward through the table, the HTML update himself to become like this :
==> row2 become row0, row3 become row1, ...
I have something like 100 tables to extract and I can't know the table length by advance.
Thanks all,
Arnaud

Extract rows using xpath instead of td/IDs since they are not constant.
Click the next page button then Extract the rows again until next page button click gives you NotFoundException (depends if the button is not visible on the last page). If you provide the HTML or website link you will get a better answer.

After a lot of testing, here is the answer :
try:
last_row = driver.find_element_by_xpath(".//tr/*[contains(#id, '--TilesTable-rows-row19-col1')]")
last_row_old = driver.find_element_by_xpath(".//tr/*[contains(#id, '--TilesTable-rows-row19-col1')]").text
last_row.click()
last_row.send_keys(Keys.PAGE_DOWN)
time.sleep(2)
last_row_new = driver.find_element_by_xpath(".//tr/*[contains(#id, '--TilesTable-rows-row19-col1')]").text
while (last_row_new == last_row_old) is False:
table = driver.find_element_by_xpath("//*[contains(#id, '--TilesTable-table')]/tbody")
td_list = table.find_elements_by_xpath(".//tr/*[contains(#id, '-col1')]")
for td in td_list:
tile_title = td.text
sh_tile = wb["Tuiles"]
sh_tile.append([catalog, tile_title])
last_row = driver.find_element_by_xpath(".//tr/*[contains(#id, '--TilesTable-rows-row19-col1')]")
last_row_old = driver.find_element_by_xpath(".//tr/*[contains(#id, '--TilesTable-rows-row19-col1')]").text
last_row.click()
last_row.send_keys(Keys.PAGE_DOWN)
time.sleep(0.5)
last_row_new = driver.find_element_by_xpath(".//tr/*[contains(#id, '--TilesTable-rows-row19-col1')]").text
except selenium.common.exceptions.NoSuchElementException:
pass

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

how to web scrape a nested table from html with python selenium - python

I would recommend you to use selenium along with BeautifulSoup . In the BeautifulSoup class when it ask about page source use the selenium function called selenium.page_source instead of requests module which cant recognise javascript.

Related

Scraping OSHA website using BeautifulSoup

BS4 webscraping to CSV file, think i am grabbing too may rows ('tr')s

Relative XPath Wrongly Selects Same Element in Loop

How to grab all tr elements from a table and click on a link?

Extract all data from a dynamic HTML table

Categories

Resources