Relative XPath Wrongly Selects Same Element in Loop - python

I'm scraping some data.
One of the data points I need is the date, but the table cells containing this data only include months and days. Luckily the year is used as a headline element to categorize the tables.
For some reason year = table.find_element(...) is selecting the same element for every iteration.
I would expect year = table.find_element(...) to select unique elements relative to each unique table element as it loops through all of them, but this isn't the case.
Actual Output
# random, hypothetical values
Page #1
element="921"
element="921"
element="921"
...
Page #2
element="1283"
element="1283"
element="1283"
...
Expected Output
# random, hypothetical values
Page #1
element="921"
element="922"
element="923"
...
Page #2
element="1283"
element="1284"
element="1285"
...
How come the following code selects the same element for every iteration on each page?
# -*- coding: utf-8 -*-
from selenium import webdriver
from selenium.webdriver import Firefox
from selenium.webdriver.common.by import By
links_sc2 = [
'https://liquipedia.net/starcraft2/Premier_Tournaments',
'https://liquipedia.net/starcraft2/Major_Tournaments',
'https://liquipedia.net/starcraft2/Minor_Tournaments',
'https://liquipedia.net/starcraft2/Minor_Tournaments/HotS',
'https://liquipedia.net/starcraft2/Minor_Tournaments/WoL'
]
ff = webdriver.Firefox(executable_path=r'C:\\WebDriver\\geckodriver.exe')
urls = []
for link in links_sc2:
tables = ff.find_elements(By.XPATH, '//h2/following::table')
for table in tables:
try:
# premier, major
year = table.find_element(By.XPATH, './preceding-sibling::h3/span').text
except:
# minor
year = table.find_element(By.XPATH, './preceding-sibling::h2/span').text
print(year)
ff.quit()

You need to use ./preceding-sibling::h3[1]/span to get the nearest h3 sibling from the context element(your table).
The preceding-sibling works like this:
./preceding-sibling::h3 will return the first h3 sibling in DOM
order, which is year 2019 for you.
But if you use indexing, then ./preceding-sibling::h3[1] will
return the nearest h3 element from the context element and further
indexing reaches to the next match in reverse of DOM order. You can also use ./preceding-sibling::h3[last()] go get the farthest sibling.

Related

Dealing with missing value scraping with bs4

this is my script to scrape odds from a particular web site (it should work also outside my country, i don't think there are restrictions yet):
from selenium import webdriver
from time import sleep
from bs4 import BeautifulSoup
import pandas as pd
import numpy as np
odds=[]
home=[]
away=[]
url = "https://www.efbet.it/scommesse/calcio/serie-c_1_31_-418"
driver = webdriver.Chrome(r"C:\chromedriver.exe")
driver.get(url)
sleep(5)
#driver.maximize_window()
#driver.find_element_by_id('onetrust-accept-btn-handler').click()
soup = BeautifulSoup(driver.page_source, "html.parser")
id = soup.find(class_="contenitore-table-grande")
for a in id.select("p[class*='tipoQuotazione_1']"):
odds.append(a.text)
for a in id.select("p[class*='font-weight-bold m-0 text-right']"):
home.append(a.text)
for a in id.select("p[class*='font-weight-bold m-0 text-left']"):
away.append(a.text)
a=np.asarray(odds)
newa= a.reshape(42,10)
df = pd.DataFrame(newa)
df1 = pd.DataFrame(home)
df2 = pd.DataFrame(away)
dftot = pd.concat([df1, df2, df], axis=1)
Now it works fine (i'm aware it could be written in a better and cleaner way) but there's an issue: when new odds are published by the website, sometimes some kind of them are missing (i.e. under over or double chance 1X 12 X2). So i would need to put a zero or null value where they are missing, if not my array would not be corresponding in lenght and in odds to their respective matches.
With ispection i see that when a value is missing there's only no text in the class tipoQuotazione:
<p class="tipoQuotazione_1">1.75</p> with value
<p class="tipoQuotazione_1"></p> when missing
Is there a way to perform this?
Thanks!
... when new odds are published by the website, sometimes some kind of
them are missing ...
As a better design suggestion, this is not only the problem you might end up with. What if the website changes a class name? That would break your code as well.
... sometimes some kind of them are missing (i.e. under over or double
chance 1X 12 X2). So i would need to put a zero or null value where
they are missing ...
for a in id.select("p[class*='tipoQuotazione_1']"):
# if a.text == "" default to 0.0
odds.append(a.text or 0.0)
Or you can do it with an if statement
if not a.text:
odds.append(0.0)

Using Beautiful Soup to pull dates from table

I'm looking to do something with bills that have been delivered to the governor - collecting dates for when they were delivered and the date of the last legislative action before they were sent.
I'm doing this for a whole series of similar URLs. Problem is, my code (below) works for some URLs and not others. I'm writing this to a pandas dataframe and then to csv file. When the code fails, it writes the else block when either if of elif should've been triggered.
Here's a fail URL: https://www.nysenate.gov/legislation/bills/2011/s663
And a succeed URL: https://www.nysenate.gov/legislation/bills/2011/s333
Take the first URL for example. Underneath the "view actions" dropdown, it says it was delivered to the governor on Jul 29, 2011. Prior to that, it was returned to assembly on Jun 20, 2011.
Using "delivered to governor" location as td in the table, I'd like to collect both dates using Bs4.
Here's what I have in my code:
check_list = [item.text.strip() for item in tablebody.select("td")]
dtg = "delivered to governor"
dtg_regex = re.compile(
'/.*(\S\S\S\S\S\S\S\S\S\s\S\S\s\S\S\S\S\S\S\S\S).*'
)
if dtg in check_list:
i = check_list.index(dtg)
transfer_list.append(check_list[i+1]) ## last legislative action date (not counting dtg)
transfer_list.append(check_list[i-1]) ## dtg date
elif any(dtg_regex.match(dtg_check_list) for dtg_check_list in check_list):
transfer_list.append(check_list[4])
transfer_list.append(check_list[2])
else:
transfer_list.append("no floor vote")
transfer_list.append("not delivered to governor")
You could use :has and :contains to target the right first row and find_next to move to next row. You can use last-of-type to get last action in first row select_one to get first in second row. You can use the class of each "column" to move between first and second columns.
Your mileage may vary with other pages.
import requests
from bs4 import BeautifulSoup as bs
links = ['https://www.nysenate.gov/legislation/bills/2011/s663', 'https://www.nysenate.gov/legislation/bills/2011/s333']
transfer_list = []
with requests.Session() as s:
for link in links:
r = s.get(link)
soup = bs(r.content, 'lxml')
target = soup.select_one('.cbill--actions-table--row:has(td:contains("delivered"))')
if target:
print(target.select_one('.c-bill--actions-table-col1').text)
# transfer_list.append(target.select_one('.c-bill--actions-table-col1').text)
print(target.select_one('.c-bill--action-line-assembly:last-of-type, .c-bill--action-line-senate:last-of-type').text)
# transfer_list.append(target.select_one('.c-bill--action-line-assembly:last-of-type, .c-bill--action-line-senate:last-of-type').text)
print(target.find_next('tr').select_one('.c-bill--actions-table-col1').text)
# append again
print(target.find_next('tr').select_one('.c-bill--actions-table-col2 span').text)
# append again
else:
transfer_list.append("no floor vote")
transfer_list.append("not delivered to governor")
Make full use of XPath:
Get date of "delivered to governor"
//text()[contains(lower-case(.), 'delivered to governor')]/ancestor::tr/td[1]/text()
S663A - http://xpather.com/bUQ6Gva8
S333 - http://xpather.com/oTNfuH75
Get date of "returned to assembly/senate"
//text()[contains(lower-case(.), 'delivered to governor')]/ancestor::tr/following-sibling::tr/td//text()[contains(lower-case(.), 'returned to')]/ancestor::tr/td[1]/text()
S663A - http://xpather.com/Rnufm2TB
S333 - http://xpather.com/4x9UHn4L
Get date of action which precedes "delivered to governor" row regardless of what the action is
//text()[contains(lower-case(.), 'delivered to governor')]/ancestor::tr/following-sibling::tr[1]/td/text()
S663A - http://xpather.com/AUpWCFIz
S333 - http://xpather.com/u8LOCb0x

how to web scrape a nested table from html with python selenium

i want to scrape a web nested table with python selenium. The table format has 4 columns x 10 rows. The 4th column has an inner cell containing 6 spans storing 6 images in each row.
My problem is i can only scrape the first 3 columns but cannot show the 4th column data with 6 image src in correct row order.
row = mstable.find_elements_by_xpath('//*[#id="resultMainTable"]/div/div')
column = mstable.find_elements_by_xpath('//*[#id="resultMainTable"]/div/div[1]/div')
column_4th = mstable.find_elements_by_xpath('//*[#id="resultMainTable"]/div/div/div[4]')
innercell_column_4th = mstable.find_elements_by_xpath('//*[#id="resultMainTable"]/div/div/div[4]/span[1]/img')
span_1 = mstable.find_elements_by_xpath('//*[#id="resultMainTable"]/div/div/div/span[1]/img')
span_2 = mstable.find_elements_by_xpath('//*[#id="resultMainTable"]/div/div/div/span[2]/img')
for new_span_1 in span_1:
span_1_img = (new_span_1.get_attribute('src'))
for new_span_2 in span_2:
span_2_img = (new_span_2.get_attribute('src'))
for new_row in row:
print ((new_row.text), (span_1_img), (span_2_img))
I would recommend you to use selenium along with BeautifulSoup . In the BeautifulSoup class when it ask about page source use the selenium function called selenium.page_source instead of requests module which cant recognise javascript.

Scraping OSHA website using BeautifulSoup

I'm looking for help with two main things: (1) scraping a web page and (2) turning the scraped data into a pandas dataframe (mostly so I can output as .csv, but just creating a pandas df is enough for now). Here is what I have done so far for both:
(1) Scraping the web site:
I am trying to scrape this page: https://www.osha.gov/pls/imis/establishment.inspection_detail?id=1285328.015&id=1284178.015&id=1283809.015&id=1283549.015&id=1282631.015. My end goal is to create a dataframe that would ideally contain only the information I am looking for (i.e. I'd be able to select only the parts of the site that I am interested in for my df); it's OK if I have to pull in all the data for now.
As you can see from the URL as well as the ID hyperlinks underneath "Quick Link Reference" at the top of the page, there are five distinct records on this page. I would like each of these IDs/records to be treated as an individual row in my pandas df.
EDIT: Thanks to a helpful comment, I'm including an example of what I would ultimately want in the table below. The first row represents column headers/names and the second row represents the first inspection.
inspection_id open_date inspection_type close_conference close_case violations_serious_initial
1285328.015 12/28/2017 referral 12/28/2017 06/21/2018 2
Mostly relying on BeautifulSoup4, I've tried a few different options to get at the page elements I'm interested in:
# This is meant to give you the first instance of Case Status, which in the case of this page is "CLOSED".
case_status_template = html_soup.head.find('div', {"id" : "maincontain"},
class_ = "container").div.find('table', class_ = "table-bordered").find('strong').text
# I wasn't able to get the remaining Case Statuses with find_next_sibling or find_all, so I used a different method:
for table in html_soup.find_all('table', class_= "table-bordered"):
print(table.text)
# This gave me the output I needed (i.e. the Case Status for all five records on the page),
# but didn't give me the structure I wanted and didn't really allow me to connect to the other data on the page.
# I was also able to get to the same place with another page element, Inspection Details.
# This is the information reflected on the page after "Inspection: ", directly below Case Status.
insp_details_template = html_soup.head.find('div', {"id" : "maincontain"},
class_ = "container").div.find('table', class_ = "table-unbordered")
for div in html_soup.find_all('table', class_ = "table-unbordered"):
print(div.text)
# Unfortunately, although I could get these two pieces of information to print,
# I realized I would have a hard time getting the rest of the information for each record.
# I also knew that it would be hard to connect/roll all of these up at the record level.
So, I tried a slightly different approach. By focusing instead on a version of that page with a single inspection record, I thought maybe I could just hack it by using this bit of code:
url = 'https://www.osha.gov/pls/imis/establishment.inspection_detail?id=1285328.015'
response = get(url)
html_soup = BeautifulSoup(response.text, 'html.parser')
first_table = html_soup.find('table', class_ = "table-borderedu")
first_table_rows = first_table.find_all('tr')
for tr in first_table_rows:
td = tr.find_all('td')
row = [i.text for i in td]
print(row)
# Then, actually using pandas to get the data into a df and out as a .csv.
dfs_osha = pd.read_html('https://www.osha.gov/pls/imis/establishment.inspection_detail?id=1285328.015',header=1)
for df in dfs_osha:
print(df)
path = r'~\foo'
dfs_osha = pd.read_html('https://www.osha.gov/pls/imis/establishment.inspection_detail?id=1285328.015',header=1)
for df[1,3] in dfs_osha:
df.to_csv(os.path.join(path,r'osha_output_table1_012320.csv'))
# This worked better, but didn't actually give me all of the data on the page,
# and wouldn't be replicable for the other four inspection records I'm interested in.
So, finally, I found a pretty handy example here: https://levelup.gitconnected.com/quick-web-scraping-with-python-beautiful-soup-4dde18468f1f. I was trying to work through it, and had gotten as far as coming up with this code:
for elem in all_content_raw_lxml:
wrappers = elem.find_all('div', class_ = "row-fluid")
for x in wrappers:
case_status = x.find('div', class_ = "text-center")
print(case_status)
insp_details = x.find('div', class_ = "table-responsive")
for tr in insp_details:
td = tr.find_all('td')
td_row = [i.text for i in td]
print(td_row)
violation_items = insp_details.find_next_sibling('div', class_ = "table-responsive")
for tr in violation_items:
tr = tr.find_all('tr')
tr_row = [i.text for i in tr]
print(tr_row)
print('---------------')
Unfortunately, I ran into too many bugs with this to be able to use it so I was forced to abandon the project until I got some further guidance. Hopefully the code I've shared so far at least shows the effort I've put in, even if it doesn't do much to get to the final output! Thanks.
For this type of page you don't really need beautifulsoup; pandas is enough.
url = 'your url above'
import pandas as pd
#use pandas to read the tables on the page; there are lots of them...
tables = pd.read_html(url)
#Select from this list of tables only those tables you need:
incident = [] #initialize a list of inspections
for i, table in enumerate(tables): #we need to find the index position of this table in the list; more below
if table.shape[1]==5: #all relevant tables have this shape
case = [] #initialize a list of inspection items you are interested in
case.append(table.iat[1,0]) #this is the location in the table of this particular item
case.append(table.iat[1,2].split(' ')[2]) #the string in the cell needs to be cleaned up a bit...
case.append(table.iat[9,1])
case.append(table.iat[12,3])
case.append(table.iat[13,3])
case.append(tables[i+2].iat[0,1]) #this particular item is in a table which 2 positions down from the current one; this is where the index position of the current table comes handy
incident.append(case)
columns = ["inspection_id", "open_date", "inspection_type", "close_conference", "close_case", "violations_serious_initial"]
df2 = pd.DataFrame(incident,columns=columns)
df2
Output (pardon the formatting):
inspection_id open_date inspection_type close_conference close_case violations_serious_initial
0 Nr: 1285328.015 12/28/2017 Referral 12/28/2017 06/21/2018 2
1 Nr: 1283809.015 12/18/2017 Complaint 12/18/2017 05/24/2018 5
2 Nr: 1284178.015 12/18/2017 Accident 05/17/2018 09/17/2018 1
3 Nr: 1283549.015 12/13/2017 Referral 12/13/2017 05/22/2018 3
4 Nr: 1282631.015 12/12/2017 Fat/Cat 12/12/2017 11/16/2018 1

Iterate through DIV Class using selenium on page

I'm looking to iterate through a set of rows on a page using selenium to scrape live results from the page in a quick manner. I have a code which appears to return the first row and print it, but doesn't look to be iterating through the set.
content = [browser.find_element_by_class_name('event')]
rows = [browser.find_elements_by_class_name('event__match')]
for rows in content:
goals = {}
goals['Home'] = rows.find_element_by_class_name("event__participant--home").text.strip()
goals['Away'] = rows.find_element_by_class_name("event__participant--away").text.strip()
goals['hScore'] = rows.find_element_by_class_name("event__scores").text.split("-")[1]
goals['aScore'] = rows.find_element_by_class_name("event__scores").text.split("-")[-1]
print(goals['Home'],goals['aScore'],goals['aScore'],goals['Away'])
gets me the result;
Team 1
0
0 Team 2
Which would be the expected result when it's only one match on the page - but there's 50 at the moment.
I feel like I'm missing something in my method here, it could be pretty simple and staring me in the face so apologies if that's the case!
You mistake is in for rows in content:, where content is parent div and you need rows. To iterate all through rows use code below:
rows = browser.find_elements_by_class_name('event__match')
for row in rows:
goals = {}
goals['Home'] = row.find_element_by_class_name("event__participant--home").text.strip()
goals['Away'] = row.find_element_by_class_name("event__participant--away").text.strip()
goals['hScore'] = row.find_element_by_class_name("event__scores").text.split("-")[1]
goals['aScore'] = row.find_element_by_class_name("event__scores").text.split("-")[-1]
print(goals['Home'],goals['aScore'],goals['aScore'],goals['Away'])

Categories