how to find proper xpath for selenium? - python

I'm trying to scrape this page : https://www.bitmex.com/app/trade/XBTUSD
to get the Open Interest data on the left side of the page. I am at this stage
import bs4
from bs4 import BeautifulSoup
import requests
import re
from selenium import webdriver
import urllib.request
r = requests.get('https://www.bitmex.com/app/trade/XBTUSD')
url = "https://www.bitmex.com/app/trade/XBTUSD"
page = urllib.request.urlopen('https://www.bitmex.com/app/trade/XBTUSD')
soup = bs4.BeautifulSoup(r.text, 'xml')
resultat = soup.find_all(text=re.compile("Open Interest"))
driver = webdriver.Firefox(executable_path='C:\\Users\\Samy\\Desktop\\geckodriver\\geckodriver.exe')
results = driver.find_elements_by_xpath("//*[#class='contractStats hoverContainer block']//*[#class='value']/html/body/div[1]/div/span/div[1]/div/div[2]/li/ul/div/div/div[2]/div[4]/span[2]/span/span[1]")
print(len(results))
I get 0 as a result. I tried several different things for the results variable (also driver.find_elements_by_xpath("//span[#class='price']/text()"), but can't seem to find the way. I know the problem is when I copy the XML path, but can't seem to understand clearly the issue despite reading Why does this xpath fail using lxml in python? and https://stackoverflow.com/a/43095252/7937578
I was using only the XML path obtained by copying, but after reading those SO questions I added the part at the begining[#class....] but I'm missing something. Thank you if you know how to help !

If I understood your requirements correctly, the following script should fetch you the required content from that page:
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.wait import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
link = "https://www.bitmex.com/app/trade/XBTUSD"
with webdriver.Firefox() as driver:
driver.get(link)
wait = WebDriverWait(driver,10)
items = [item.text for item in wait.until(EC.presence_of_all_elements_located((By.XPATH,"//*[#class='lineItem']/span[#class='hoverHidden'][.//*[contains(.,'Open Interest')]]//span[#class='key' or #class='value']")))]
print(items)
Output at his moment:
['Open Interest', '640,089,423 USD']

I don't know why it fails, but I think the best way to find any element is by full XPath.
Something that look like this:
homebutton = driver.find_element_by_xpath("/html/body/header/div/div[1]/a[2]/span")
Give it a try.

Full path is not the best one, also it's harder to read it. The XPath is 'filter', try to find some unique attributes for needed control, or some unique description of parent one. Look, the needed span has 'value' class, and it is located inside span with 'tooltipWrapper' class, also the parent span has another child with 'key' class and 'Open Interest' text. There are thousands of locators, I can saggest two:
//span[#class = 'tooltipWrapper' and span[string() = 'Open Interest']]//span[#class = 'value']
//span[#class = 'key' and text() = 'Open Interest']/..//span[#class = 'value']

Related

BeautifulSoup saying tag has no attributes while looking for sibling or parent tags

I'm attempting to extract the tax bill for the following web page (http://www.sarasotataxcollector.com/ecomm/proc.php?r=eBillingInvitation&a=0429120051).
The tax bill is the $8,084.54 value directly following the Taxes & Assessments string.
I need to use some static object to go off of because the code will be working over multiple pages.
The "Taxes & Assessments" string is a constant between all pages and always precedes the full tax bill, while the tax bill changes between pages.
My thought was that I could find the "Taxes & Assessment" string, then traverse the BeautifulSoup tree and find the Tax Bill. This is my code:
soup = BeautifulSoup(html_content,'html.parser') #Soupify the HTML content
tagTandA = soup.body.find(text = "Taxes & Assessments")
taxBill = tagTandA.find_next_sibling.text
This returns an error of:
AttributeError: 'NoneType' object has no attribute 'find_next_sibling'
In fact, any fcn of parent, next_sibling, find_next_sibling, or anything of the sorts returns this object has no attribute error.
I have tried looking for other explicit text, just to test that it's not this specific text that is giving me an issue, and the no attribute error is still thrown.
When running just the following code, it returns "None":
tagTandA = soup.body.find(text = "Taxes & Assessments")
How can I find the "Taxes & Assessments" tag in order to navigate the tree to find and return the Tax Bill?
If I'm not mistaken you're trying to use a requests & bs based solution to scrape a (very) JS heavy website, with a redirect, and some iframes.
I don't think it will work.
Here is one way of getting that information (you can improve that hardcoded wait if you want, I just didn't have time to fiddle) using Selenium (there are some unused imports, you can get rid of them):
from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.common.action_chains import ActionChains
import time as t
import pandas as pd
pd.set_option('display.max_columns', None)
pd.set_option('display.max_colwidth', None)
chrome_options = Options()
chrome_options.add_argument("--no-sandbox")
chrome_options.add_argument('disable-notifications')
chrome_options.add_argument("window-size=1280,720")
webdriver_service = Service("chromedriver_linux64/chromedriver") ## path to where you saved chromedriver binary
driver = webdriver.Chrome(service=webdriver_service, options=chrome_options)
wait = WebDriverWait(driver, 25)
url = 'http://www.sarasotataxcollector.com/ecomm/proc.php?r=eBillingInvitation&a=0429120051'
driver.get(url)
t.sleep(15)
wait.until(EC.frame_to_be_available_and_switch_to_it((By.XPATH, '//*[#name="body"]')))
total_taxes = wait.until(EC.element_to_be_clickable((By.XPATH, "//font[contains(text(), 'Taxes & Assessments')]/ancestor::td/following-sibling::td")))
print('Tax bill: ', total_taxes.text)
Result in terminal:
Tax bill: $8,084.54
See Selenium documentation for more details.
A very kind person (u/commandlineuser) answered the question in BS code here: https://www.reddit.com/r/learnpython/comments/10hywbs/beautifulsoup_saying_tag_has_no_attributes_while/
Here's said code:
import re
import requests
from bs4 import BeautifulSoup
url = ""
r1 = requests.get(url)
soup1 = BeautifulSoup(r1.content, "html.parser")
base = r1.url[:r1.url.rfind("/") + 1]
href1 = soup1.find("frame").get("src")
r2 = requests.get(base + href1)
soup2 = BeautifulSoup(
r2.content
.replace(b"<!--", b"") # basic attempt at stripping comments
.replace(b"-->", b""),
"html.parser"
)
href2 = soup2.find("voicemax").get_text(strip=True)
r3 = requests.get(base + href2)
soup3 = BeautifulSoup(r3.content, "html.parser")
total = (
soup3.find(text=re.compile("Taxes & Assessments"))
.find_next()
.get_text(strip=True)
)
print(total)
find_next_sibling is a function/method, use find_next_sibling().
There is a similar property so I can see the confusion.

How to identify all tags used in website using Selenium

The BeautifulSoup equivalent I am trying to accomplish is:
page_soup = soup(page_html)
tags = {tag.name for tag in page_soup.find_all()}
tags
How do I do this using Selenium? I'm just trying to print out the unique tags used by a website without having to go through the entire HTML source code, so I can begin analysing it and scrape specific parts of the website. I don't care what the content of the tags are at this point, I just want to know what tags are used.
An answer I've stumbled upon, but not sure if there is a more elegant way of doing things is this...
from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from webdriver_manager.chrome import ChromeDriverManager
from selenium.webdriver.common.by import By
website = 'https://www.afr.com'
driver = webdriver.Chrome(service=Service(ChromeDriverManager().install()))
driver.get(website)
el = driver.find_elements(by=By.CSS_SELECTOR, value='*')
tag_list = []
for e in el:
tag_list.append(e.tag_name)
tag_list = pd.Series(tag_list).unique()
for t in tag_list:
print(t)
Beautifulsoup is better for this specific scenario.
But if you still want to use Selenium, you can try:
elems = driver.find_elements_by_tag_name('*')
tags = []
for x in elems:
taggs.append(x.tag_name)
Which is equivalent to:
elems = driver.find_elements_by_tag_name('*')
tags = [x.tag_name for x in elems]
If you finally want to get only the unique values, you could use the set() built-in data type for example:
set(tags)

Web Scraping in python without "ids" in table

Hi this is my first time attempting to webscrape in python using Beautiful soup. The problem that i am having is I am trying to scrape data off of a table from a website but the tables do not have ids. Say I was able to get the id of the element above the tr in the table is there anyway to scrape the data under that element.
This is what I am trying to scrape
I am able to grab the id="boat" in the first tr but I am trying to access the tr underneath it the problem is it has a class of "bottomline" this is a problem because the class name "bottomline" is used in multiple tr's which all have different values and i cant access the div with the class name of "tooltip" because the name is also used in multiple divs
So ultimitly my question is, is there away to scrape the data in tr that is under id="boat"
Thanks for any help in advance!
Beautiful Soup builds a tree for you. You are not required to have any identifying information about an element in order to find it, as long as you know the structure of the tree... which you do.
In your example, you already have the <strong> element with the ID you were looking for. If you look at the HTML, you see it is a child of a <td>, which is itself a child of a <tr>. BS4 allows you to move up the tree by iterating parents of an element:
name = soup.find(id = 'boat')
print(name)
for parent_row in name.parents:
if parent_row.name == 'tr':
break
At this point the variable parent_row will be set to the <tr> containing your <strong>.
Next, you can see that the data you are looking for is in the next <tr> after parent, which in BS4 terminology is a sibling of parent_row. You can iterate siblings similarly:
for sibling_row in parent_row.next_siblings:
if sibling_row.name == 'tr':
break
And at this point you have the row you need, and you can get the content:
content = list(sibling_row.stripped_strings)
print(content)
Putting it all together using the code in your later post:
import requests
from bs4 import BeautifulSoup
URL = "https://www.minecraftcraftingguide.net"
r = requests.get(URL)
soup = BeautifulSoup(r.content, 'html5lib')
print(soup.prettify())
name = soup.find(id = 'boat')
print(name)
for parent_row in name.parents:
if parent_row.name == 'tr':
break
for sibling_row in parent_row.next_siblings:
if sibling_row.name == 'tr':
break
content = list(sibling_row.stripped_strings)
print(content)
If you are scraping from a table, maybe pd.read_html() from the Pandas module can help here. I cannot reproduce your example because you have not offered any reproducible code, but you could try the following:
import requests
import pandas as pd
# Make a request and try to get the table using pandas
r = requests.get("your_url")
df = pd.read_html(r.content)[0]
If pandas is able to capture a dataframe from the response then you should be able to access all the data in the table as if you were using pandas over a normal dataframe. This has worked for me many times when performing this kind of tasks.
this is what my code looks like
from ask_sdk_core.dispatch_components import AbstractRequestHandler, AbstractExceptionHandler
from ask_sdk_core.handler_input import HandlerInput
from ask_sdk_model.ui import SimpleCard
import feedparserimport requests
from bs4 import BeautifulSoup
import webbrowser
URL = "https://www.minecraftcraftingguide.net"
r = requests.get(URL)
soup = BeautifulSoup(r.content, 'html5lib')
print(soup.prettify())
name = soup.find(id = 'boat')
print(name)
class MinecraftHelperIntentHandler(AbstractRequestHandler):
"""Handler for minecraft helper intent"""
def can_handle(self, handler_input):
return ask_utils.is_intent_name("MinecraftHelperIntent")(handler_input)
def handle(self, handler_input):
slots = handler_input.request_envelope.request.intent.slots
# interactionModel.languageModel.intents[].slots[].multipleValues.enabled
item = slots['Item'].value
itemStr = item.str();
imgStart = 'https://www.minecraftcraftingguide.net/img/crafting/'
imgMid = item
imgEnd = '-crafting.png'
imgLink = imgStart + imgMid + imgEnd
print(imgLink)
speak_output = f'To craft that you will need {item} here is a link {imgLink}'
return (
handler_input.response_builder
.speak(speak_output)
.set_card(SimpleCard('test', 'card_text'))#might need to link account for it to work
.response
)```
I have the same issue recently... but I'm using selenium instead of Beautiful soup. In my case to fix the issue I have to:
first identify a table parameter to use as reference, then
follow the table tree on the on web page that I was trying to scrape, after that
put everything in a xpath element like the code before:
from pyotp import *
from selenium import webdriver
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
import selenium.webdriver.support.ui as ui
from selenium.webdriver.common.keys import Keys
get_the_value_from_td = driver.find_element_by_xpath('//table[#width="517"]/tbody/tr[8]/td[8]').text
This link was very helpful to me: https://www.guru99.com/selenium-webtable.html

Using Xpath in Selenium to refer to a specific tag

Hello I'm getting the error of AttributeError: 'str' object has no attribute 'find_element' , even though my code seem to be correct and I looked a little bit in stack but I didnt find a solution to my specific problem.
I wanna get the value of the fifth <p> set after the <br> tag in my html code , correct me if you see any mistake please.
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.common.by import By
import pandas as pd
import time
options = Options()
# Creating our dictionary
all_services = pd.DataFrame(columns=['Motif', 'Description'])
path = "C:/Users/Al4D1N/Documents/ChromeDriver_webscraping/chromedriver.exe"
driver = webdriver.Chrome(options=options, executable_path=path)
driver.get("https://www.mairie.net/national/acte-naissance.htm#plus")
# We store our 'Motif' & 'Description' for the first link 'acte-naissance'
service = driver.find_element(By.CLASS_NAME, "section-title").text
desc = service.find_element(By.XPATH, "//*[#class='section-group']/p[5]/following::br")
print(desc)
all_services = all_services.append({'Motif': service, 'Description': desc}, ignore_index=True)
# Get all elements in class 'list-images'
list_of_services = driver.find_elements_by_class_name("list-images")
all_services.to_excel('Services.xlsx', index=False)
I was able to get the text you want using this combination:
service = driver.find_element_by_class_name("section-title").text
txt = driver.find_element_by_xpath("//*[#class='section-group']/p[5]").text
desc = txt.split("\n")[1]
I also did it with requests and bs4, which is usually a little bit quicker than a headless browser (same issues of scraper navigability):
import requests
from bs4 import BeautifulSoup as Bs
r = requests.get("https://www.mairie.net/national/acte-naissance.htm#plus")
html = Bs(r.text, "lxml")
section = html.find("h1", {"class": "section-title"}).get_text()
div = html.find("div", {"class": "section-group"})
p = div.find_all("p")[4]
p.strong.extract()
txt = p.get_text().split("\t")
desc = (txt[1] + txt[2]).replace("\n", " ")[:-1]
Try to remove .text from
service = driver.find_element(By.CLASS_NAME, "section-title").text

BS fails to get section id in Selenium retrieved page

I have a problem with the following code
import re
from lxml import html
from bs4 import BeautifulSoup as BS
from selenium import webdriver
from selenium.webdriver.firefox.firefox_binary import FirefoxBinary
import requests
import sys
import datetime
print ('start!')
print(datetime.datetime.now())
list_file = 'list2.csv'
#This should be the regular input list
url_list=["http://www.genecards.org/cgi-bin/carddisp.pl?gene=ENO3&keywords=ENO3"]
#This is an example input instead
binary = FirefoxBinary('C:/Program Files (x86)/Mozilla Firefox/firefox.exe')
#Read somewhere it could be a variable useful to supply but anyway, the program fails randomly at time with [WinError 6] Invalid Descriptor while having nothing different from when it is able to at least get the webpage; even when not able to perform further operation.
for page in url_list:
print(page)
browser = webdriver.Firefox(firefox_binary=binary)
#I tried this too to solve the [WinError 6] but it is not working
browser.get(page)
print ("TEST BEGINS")
soup=BS(browser.page_source,"lxml")
soup=soup.find("summaries")
# This fails here. It finds nothing, while there is a section id termed summaries. soup.find_all("p") works but i don't want all the p's outside of summaries
print(soup) #It prints "None" indeed.
print ("TEST ENDS")
I am positive source code includes "summaries". First there is
<li> Summaries</li>
then there is
<section id="summaries" data-ga-label="Summaries" data-section="Summaries">
As suggested here (Webscraping in python: BS, selenium, and None error) by #alexce, I tried
summary = soup.find('section', attrs={'id':'summaries'})
(Edit: the suggestion was _summaries but I did tested summaries too)
but it does not work either.
So my questions are:
why does BS not find the summaries, and why does selenium keep breaking when I use the script too much in a row (restarting a console works, on the other hand, but this is tedious), or with a list comprising more than four instances?
Thanks
This:
summary = soup.find('section', attrs={'id':'_summaries'})
Search for element section that have the attribute id set to _summaries:
<section id="_summary" />
There is no element with these attribute in the page.
The one that you want is probably <section id="summaries" data-ga-label="Summaries" data-section="Summaries">. And can be matched with:
results = soup.find('section', id_='summaries')
Also, side note on why you use Selenium. The page will return an error if you do not forward cookies. So in order to use requests, you need to send cookies.
My full code:
1 from __future__ import unicode_literals
2
3 import re
4 import requests
5 from bs4 import BeautifulSoup as BS
6
7
8 data = requests.get(
9 'http://www.genecards.org/cgi-bin/carddisp.pl?gene=ENO3&keywords=ENO3',
10 cookies={
11 'nlbi_146342': '+fhjaf6NSntlOWmvFHlFeAAAAAAwHqv5tJUsy3kqgNQOt77C',
12 'visid_incap_146342': 'tEumui9aQoue4yMuu9tuUcly6VYAAAAAQUIPAAAAAABcQsCGxBC1gj0OdNFoMEx+',
13 'incap_ses_189_146342': 'bNY8PNPZJzroIFLs6nefAspy6VYAAAAAYlWrxz2UrYFlrqgcQY9AuQ=='
14 }).content
15
16 soup=BS(data)
17 results=soup.find_all(string=re.compile('summary', re.I))
18 print(results)
19 summary_re = re.compile('summary', re.I)
20 results = soup.find('section', id_='summaries')
21 print(results)
The element is probably not yet on the page. I would wait for the element before parsing the page source with BS:
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
driver = webdriver.Firefox()
driver.get("http://www.genecards.org/cgi-bin/carddisp.pl?gene=ENO3&keywords=ENO3")
wait = WebDriverWait(driver, 10)
wait.until(EC.visibility_of_element_located((By.ID, "summaries")))
soup = BS(driver.page_source,"lxml")
I noticed that you never call driver.quit(), this may be the reason of your breaking issues.
So make sure to call it or try to reuse the same session.
And to make it more stable and performant, I would try to work with the Selenium API as mush as possible since pulling and parsing the page source is expensive.

Categories