splitting the outerHTML attribute in python

splitting the outerHTML attribute in python - python

I would want to split out particular text from the outerHTML attribute for a web link.
while Id is true:
link = driver.find_element_by_xpath("//a[#id='bu:ms:all-sp:2']")
href = link.get_attribute("outerHTML")
link.click()
# This will load the link in the same page !
self.assertIn(href, self.page.get_current_url())
When I print the href, output would be,
<a id="bu:ms:all-sp:8" href="/euro/tennis" class="Pointer"><span class="SportImg8"></span> Tennis <span class="NumEvt">51</span></a>
I would want to split this and assert the value of href alone (/euro/tennis) with the current URL.
Could anyone please help me out here ?

Get href attribute instead of outerHTML:
href = link.get_attribute("href")

Related

Selenium driver cannot return href attribute

I am trying to read the 'href' attribute from a website. Now I have the problem that a 'div' has several 'a'. From the second 'a' the 'href'-attribute can be easily read, but not from the second 'a'.
This is the following website:
https://www.google.ca/search?q=Jahresringe+Holz&hl=en&authuser=0&tbm=isch&source=hp&biw=&bih=&ei=mPc1YevoA4Svggfjk634CA
and from this website I look at the first picture.
Here is the HTML code of the website, unfortunately as an image, because I could not paste the code: HTML Code
My Python Code:
for i in range(1,200):
xPathOfAllA = '//*[#id="islrg"]/div[1]/div['+str(i)+']/a'
el = driver.find_elements_by_xpath(xPathOfAllA)
href = el[0].get_attribute('href') #Returning: None
href2 = el[1].get_attribute('href') #Returning: https://www.vv[...]
[...]
The right result should be: /imgres?imgurl[...]
Thank for every help and I have also read the other stack overflow entries, but my problem seems to be quite different.

for first iteration of the loop that you've
el = driver.find_elements_by_xpath(xPathOfAllA)
el[0], represent
<a class="wXeWr islib nfEiy" jsname="sTFXNd" jsaction="J9iaEb;mousedown:npT2md; touchstart:npT2md;" data-nav="1" tabindex="0" style="height: 180px;">
which does not have href so it's obvious that it would return None. As you mentioned in comment als0 #Returning: None
Now look for second element :
<a class="VFACy kGQAp sMi44c lNHeqe WGvvNb" data-ved="2ahUKEwi42aPOperyAhU-BLcAHf1VD_UQr4kDegUIARC7AQ" jsname="uy6ald" rel="noopener" target="_blank" href="https://de.wikipedia.org/wiki/Jahresring" jsaction="focus:kvVbVb;mousedown:kvVbVb;touchstart:kvVbVb;" title="Jahresring – Wikipedia">Jahresring – Wikipedia<div class="fxgdke">de.wikipedia.org</div></a>
this has href, href="https://de.wikipedia.org/wiki/Jahresring" so is the reason you are getting #Returning: https://www.vv[...]
Solution :
You can filter your xpath expression, If I have understood your question correctly you are looking for all the a tag with href ?
if so, use the below xpath :
//a[contains(#href,'')]
with find_elements to have a list of web elements then simply get the attribute href and either store in into a list or print it on console.
Update 1 :
//div[1]/div[1]/a[1]/div[1]/img/../..
the above mentioned xpath should give you first image href, if you call
a = driver.find_element_by_xpath("//div[1]/div[1]/a[1]/div[1]/img/../..").get_attribute('href')
print(a)

Extract specific HREF with xpath or css

recently I have tackled one unusual element that's not trivial to scrape. Could you suggest please how to retrieve the href please.
I am scraping some Tripadvisor's restaurants with python scrapy and need to retrieve Google Map's link (href attribute) from location and contacts section. Could you suggest how to
The webpage for example (link)
The code of the element:
<a data-encoded-url="S0k3X2h0dHBzOi8vbWFwcy5nb29nbGUuY29tL21hcHM/c2FkZHI9JmRhZGRyPVNjYWJlbGxzdHIuKzEwLTExJTJDKzE0MTA5K0JlcmxpbitHZXJtYW55QDUyLjQyODgxOCwxMy4xODI0MjFfeVBw" class="_2wKz--mA _27M8V6YV" target="_blank" href="**https://maps.google.com/maps?saddr=&daddr=Scabellstr.+10-11%2C+14109+Berlin+Germany#52.428818,13.182421**"><span class="_2saB_OSe">Scabellstr. 10-11, 14109 Berlin Germany</span><span class="ui_icon external-link-no-box _2OpUzCuO"></span></a>
I have tried the following XPATH, but got None as response every time or couldn't get data on the href attribute as if it doesn't exist.
response.xpath("//a[contains(#class, '_2wKz--mA _27M8V6YV')]").getall()
The output:
['<a data-encoded-url="Z3pLX2h0dHBzOi8vbWFwcy5nb29nbGUuY29tL21hcHM/c2FkZHI9JmRhZGRyPVNjYWJlbGxzdHIuKzEwLTExJTJDKzE0MTA5K0JlcmxpbitHZXJtYW55QDUyLjQyODgxOCwxMy4xODI0MjFfMk1z" class="_2wKz--mA _27M8V6YV" target="_blank"><span class="_2saB_OSe">Scabellstr. 10-11, 14109 Berlin Germany</span><span class="ui_icon external-link-no-box _2OpUzCuO"></span></a>',
'Website']

Use the data-encoded-url that you already got and decode it using Base64. Example:
>>> import base64
>>> base64.b64decode("Z3pLX2h0dHBzOi8vbWFwcy5nb29nbGUuY29tL21hcHM/c2FkZHI9JmRhZGRyPVNjYWJlbGxzdHIuKzEwLTExJTJDKzE0MTA5K0JlcmxpbitHZXJtYW55QDUyLjQyODgxOCwxMy4xODI0MjFfMk1z").decode("utf-8")
'gzK_https://maps.google.com/maps?saddr=&daddr=Scabellstr.+10-11%2C+14109+Berlin+Germany#52.428818,13.182421_2Ms'
You can then remove the gzK_ prefix and _2Ms suffix and you will have your URL.

You try the specific XPath query to get the href like "//a[contains(#class, 'foobar')]/#href" to retrieve a specific attribute of the element.

How to extract a href from an a class in beautiful soup?

I'm trying to extract href= in an class but am unable to extract it.
I've tried url = tag_variable.find("href"), but am getting None.
<a class="product-card__name" href="/store/groceryGateway/en/Herbs/Fresh/Longo%27s-Fresh-Herbs-Basil/p/00772468010517">
<strong>
Longo's Fresh Herbs Basil</strong>
</a>

href is an attribute or property of the a tag, not a tag object itself which find wants.
Assuming you have the desired a tag as tag_variable, you can use subscription like dict:
url = tag_variable["href"]

(Python) How to use driver.find_element_by_link_text when two texts are there in between <a> tag

I have this HTML
text1</span> <br /><span class="UC">text2</span>
I want to get the hyperlink and click on it. I write:
link = driver.find_element_by_link_text('text')
link.click()
But the problem is there are two texts in between "a" tag. How do I modify the syntax?

Try below code:
link = driver.find_element_by_link_text('text1\ntext2')
link.click()
There is also possibility to find element by "text1" or "text2" using find_element_by_partial_link_text():
link = driver.find_element_by_partial_link_text('text1')
link.click()

Python 3, beautiful soup, get next tag

I have the following html part which repeates itself several times with other href links:
<div class="product-list-item margin-bottom">
<a title="titleexample" href="http://www.urlexample.com/example_1" data-style-id="sp_2866">
Now I want to get all the href links in this document that are directly after the div tag with the class "product-list-item".
Pretty new to beautifulsoup and nothing that I came up with worked.
Thanks for your ideas.
EDIT: Does not really have to be beautifulsoup; when it can be done with regex and the python html parser this is also ok.
EDIT2: What I tried (I'm pretty new to python, so what I did might be totaly stupid from an advanced viewpoint):
soup = bs4.BeautifulSoup(htmlsource)
x = soup.find_all("div")
for i in range(len(x)):
if x[i].get("class") and "product-list-item" in x[i].get("class"):
print(x[i].get("class"))
This will give me a list of all the "product-list-item" but then I tried something like
print(x[i].get("class").next_element)
Because I thought next_element or next_sibling should give me the next tag but it just leads to AttributeError: 'list' object has no attribute 'next_element'. So I tried with only the first list element:
print(x[i][0].get("class").next_element)
Which led to this error: return self.attrs[key] KeyError: 0.
Also tried with .find_all("href") and .get("href") but this all leads to the same errors.
EDIT3: Ok seems I found out how to solve it, now I did:
x = soup.find_all("div")
for i in range(len(x)):
if x[i].get("class") and "product-list-item" in x[i].get("class"):
print(x[i].next_element.next_element.get("href"))
This can also be shortened by using another attribute to the find_all function:
x = soup.find_all("div", "product-list-item")
for i in x:
print(i.next_element.next_element.get("href"))
greetings

I want to get all the href links in this document that are directly after the div tag with the class "product-list-item"
To find the first <a href> element in the <div>:
links = []
for div in soup.find_all('div', 'product-list-item'):
a = div.find('a', href=True) # find <a> anywhere in <div>
if a is not None:
links.append(a['href'])
It assumes that the link is inside <div>. Any elements in <div> before the first <a href> are ignored.
If you'd like; you can be more strict about it e.g., taking the link only if it is the first child in <div>:
a = div.contents[0] # take the very first child even if it is not a Tag
if a.name == 'a' and a.has_attr('href'):
links.append(a['href'])
Or if <a> is not inside <div>:
a = div.find_next('a', href=True) # find <a> that appears after <div>
if a is not None:
links.append(a['href'])
There are many ways to search and navigate in BeautifulSoup.
If you search with lxml.html, you could also use xpath and css expressions if you are familiar with them.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

splitting the outerHTML attribute in python - python

Get href attribute instead of outerHTML: href = link.get_attribute("href")

Related

Selenium driver cannot return href attribute

Extract specific HREF with xpath or css

How to extract a href from an a class in beautiful soup?

(Python) How to use driver.find_element_by_link_text when two texts are there in between <a> tag

Python 3, beautiful soup, get next tag

Categories

Resources