Selenium driver cannot return href attribute - python

I am trying to read the 'href' attribute from a website. Now I have the problem that a 'div' has several 'a'. From the second 'a' the 'href'-attribute can be easily read, but not from the second 'a'.
This is the following website:
https://www.google.ca/search?q=Jahresringe+Holz&hl=en&authuser=0&tbm=isch&source=hp&biw=&bih=&ei=mPc1YevoA4Svggfjk634CA
and from this website I look at the first picture.
Here is the HTML code of the website, unfortunately as an image, because I could not paste the code: HTML Code
My Python Code:
for i in range(1,200):
xPathOfAllA = '//*[#id="islrg"]/div[1]/div['+str(i)+']/a'
el = driver.find_elements_by_xpath(xPathOfAllA)
href = el[0].get_attribute('href') #Returning: None
href2 = el[1].get_attribute('href') #Returning: https://www.vv[...]
[...]
The right result should be: /imgres?imgurl[...]
Thank for every help and I have also read the other stack overflow entries, but my problem seems to be quite different.

for first iteration of the loop that you've
el = driver.find_elements_by_xpath(xPathOfAllA)
el[0], represent
<a class="wXeWr islib nfEiy" jsname="sTFXNd" jsaction="J9iaEb;mousedown:npT2md; touchstart:npT2md;" data-nav="1" tabindex="0" style="height: 180px;">
which does not have href so it's obvious that it would return None. As you mentioned in comment als0 #Returning: None
Now look for second element :
<a class="VFACy kGQAp sMi44c lNHeqe WGvvNb" data-ved="2ahUKEwi42aPOperyAhU-BLcAHf1VD_UQr4kDegUIARC7AQ" jsname="uy6ald" rel="noopener" target="_blank" href="https://de.wikipedia.org/wiki/Jahresring" jsaction="focus:kvVbVb;mousedown:kvVbVb;touchstart:kvVbVb;" title="Jahresring – Wikipedia">Jahresring – Wikipedia<div class="fxgdke">de.wikipedia.org</div></a>
this has href, href="https://de.wikipedia.org/wiki/Jahresring" so is the reason you are getting #Returning: https://www.vv[...]
Solution :
You can filter your xpath expression, If I have understood your question correctly you are looking for all the a tag with href ?
if so, use the below xpath :
//a[contains(#href,'')]
with find_elements to have a list of web elements then simply get the attribute href and either store in into a list or print it on console.
Update 1 :
//div[1]/div[1]/a[1]/div[1]/img/../..
the above mentioned xpath should give you first image href, if you call
a = driver.find_element_by_xpath("//div[1]/div[1]/a[1]/div[1]/img/../..").get_attribute('href')
print(a)

Related

Selenium starts-with searchs entire page not in given Webelement

I want to search class name with starts-with in specific Webelement but it search in entire page. I do not know what is wrong.
This returns list
muidatagrid_rows = driver.find_elements(by=By.CLASS_NAME, value='MuiDataGrid-row')
one_row = muidatagrid_rows[0]
This HTML piece in WebElement (one_row)
<div class="market-watcher-title_os_button_container__4-yG+">
<div class="market-watcher-title_tags_container__F37og"></div>
<div>
<a href="#" target="blank" rel="noreferrer" data-testid="ios download button for 1628080370">
<img class="apple-badge-icon-image"></a>
</div>
<div></div>
</div>
If a search with full class name like this:
tags_and_marketplace_section = one_row.find_element(by=By.CLASS_NAME, value="market-watcher-title_os_button_container__4-yG+")
It gives error:
selenium.common.exceptions.InvalidSelectorException: Message: Given css selector expression ".market-watcher-title_os_button_container__4-yG+" is invalid: InvalidSelectorError: Element.querySelector: '.market-watcher-title_os_button_container__4-yG+' is not a valid selector: ".market-watcher-title_os_button_container__4-yG+"
So i want to search with starts-with method but i can not get what i want.
This should returns only two Webelements but it returns 20
tags_and_marketplace_section = one_row.find_element(by=By.XPATH, value='//div[starts-with(#class, "market-watcher-")]')
print(len(tags_and_marketplace_section))
>>> 20
Without seeing the codebase you are scraping from it's difficult to help fully, however what I've found is that "Chaining" values can help to narrow down the returned results. Also, using the "By.CSS_SELECTOR" method works best for me.
For example, if what you want is inside a div and p, then you would do something like this;
driver.find_elements(by=By.CSS_SELECTOR, value="div #MuiDataGrid-row p")
Then you can work with the elements that are returned as you described. You maybe able to use other methods/selectors but this is my favourite route so far.

Beatifulsoup doesnt return href it returns None

>>> soup_brand
<a data-role="BRAND" href="/URL/somename">
Some Name
</a>
>>> type(soup_brand)
<class 'bs4.BeautifulSoup'>
>>> print(soup_brand.get('href'))
None
Documentation followed: https://www.crummy.com/software/BeautifulSoup/bs4/doc/
Hi people from all over the world,
does someone now whats going wrong or am I targeting the object wrong ?
Need to get the href.
Have you tried:
soup.find_all(name="a")
or
soup.select_one(selector="a")
it should also be possible to catch with
all_anchor_tags = soup.find_all(name="a")
for tag in all_anchor_tags:
print(tag.get("href")) #prints the href element of each a tag, thus each link
Although the all bs4 looks for multiple elemnts (the reason why we have a loop here) I encountered, that bs4 sometime is better in catching things, if you give it a search for all approach and then iterate over the elements
in order to apply ['href'] the object must be <bs4.Element.Tag>.
so, try this:
string = \
"""
<a data-role="BRAND" href="/URL/somename">
Some Name
</a>
"""
s = BeautifulSoup(string)
a_tag = s.find('a')
print(a_tag["href"])
out
/URL/somename
or if you have multiple a tags you can try this:
a_tags = s.findAll('a')
for a in a_tags:
print(a.get("href"))
out
/URL/somename

Extract specific HREF with xpath or css

recently I have tackled one unusual element that's not trivial to scrape. Could you suggest please how to retrieve the href please.
I am scraping some Tripadvisor's restaurants with python scrapy and need to retrieve Google Map's link (href attribute) from location and contacts section. Could you suggest how to
The webpage for example (link)
The code of the element:
<a data-encoded-url="S0k3X2h0dHBzOi8vbWFwcy5nb29nbGUuY29tL21hcHM/c2FkZHI9JmRhZGRyPVNjYWJlbGxzdHIuKzEwLTExJTJDKzE0MTA5K0JlcmxpbitHZXJtYW55QDUyLjQyODgxOCwxMy4xODI0MjFfeVBw" class="_2wKz--mA _27M8V6YV" target="_blank" href="**https://maps.google.com/maps?saddr=&daddr=Scabellstr.+10-11%2C+14109+Berlin+Germany#52.428818,13.182421**"><span class="_2saB_OSe">Scabellstr. 10-11, 14109 Berlin Germany</span><span class="ui_icon external-link-no-box _2OpUzCuO"></span></a>
I have tried the following XPATH, but got None as response every time or couldn't get data on the href attribute as if it doesn't exist.
response.xpath("//a[contains(#class, '_2wKz--mA _27M8V6YV')]").getall()
The output:
['<a data-encoded-url="Z3pLX2h0dHBzOi8vbWFwcy5nb29nbGUuY29tL21hcHM/c2FkZHI9JmRhZGRyPVNjYWJlbGxzdHIuKzEwLTExJTJDKzE0MTA5K0JlcmxpbitHZXJtYW55QDUyLjQyODgxOCwxMy4xODI0MjFfMk1z" class="_2wKz--mA _27M8V6YV" target="_blank"><span class="_2saB_OSe">Scabellstr. 10-11, 14109 Berlin Germany</span><span class="ui_icon external-link-no-box _2OpUzCuO"></span></a>',
'Website']
Use the data-encoded-url that you already got and decode it using Base64. Example:
>>> import base64
>>> base64.b64decode("Z3pLX2h0dHBzOi8vbWFwcy5nb29nbGUuY29tL21hcHM/c2FkZHI9JmRhZGRyPVNjYWJlbGxzdHIuKzEwLTExJTJDKzE0MTA5K0JlcmxpbitHZXJtYW55QDUyLjQyODgxOCwxMy4xODI0MjFfMk1z").decode("utf-8")
'gzK_https://maps.google.com/maps?saddr=&daddr=Scabellstr.+10-11%2C+14109+Berlin+Germany#52.428818,13.182421_2Ms'
You can then remove the gzK_ prefix and _2Ms suffix and you will have your URL.
You try the specific XPath query to get the href like "//a[contains(#class, 'foobar')]/#href" to retrieve a specific attribute of the element.

Pull Title attribute with out .get("title")

I'm having value like
<a href="/for-sale/property/abu-dhabi/page-3/" title="Next" class="b7880daf"><div title="Next" class="ea747e34 ">
I need to pull out only ""Next" from title="Next" for the one i used
soup.find('a',attrs={"title": "Next"}).get('title')
is there any method to get the tittle value with out using .get("title")
My code
next_page_text = soup.find('a',attrs={"title": "Next"}).get('title')
Output:
Next
I need:
next_page_text = soup.find('a',attrs={"title": "Next"})
Output:
Next
Please let me know if there is any method to find.
You should get Next.Try this. Using find() or select_one() and Use If to check if element is present on a page.
from bs4 import BeautifulSoup
import requests
res=requests.get("https://www.bayut.com/for-sale/property/abu-dhabi/page-182/")
soup=BeautifulSoup(res.text,"html.parser")
if soup.find("a", attrs={"title": "Next"}):
print(soup.find("a", attrs={"title": "Next"})['title'])
If you want to use css selector.
if soup.select_one("a[title='Next']"):
print(soup.select_one("a[title='Next']")['title'])
I'm re-writing my answer as there was confusion in your original post.
If you'd like to take the URL associated with the Next tag:
soup.find('a', title='Next')['href']
['href'] can be replaced with any other attribute in the element, so title, itemprop etc.
If you'd like to select the element with Next in the title:
soup.find('a', title='Next')

Python 3, beautiful soup, get next tag

I have the following html part which repeates itself several times with other href links:
<div class="product-list-item margin-bottom">
<a title="titleexample" href="http://www.urlexample.com/example_1" data-style-id="sp_2866">
Now I want to get all the href links in this document that are directly after the div tag with the class "product-list-item".
Pretty new to beautifulsoup and nothing that I came up with worked.
Thanks for your ideas.
EDIT: Does not really have to be beautifulsoup; when it can be done with regex and the python html parser this is also ok.
EDIT2: What I tried (I'm pretty new to python, so what I did might be totaly stupid from an advanced viewpoint):
soup = bs4.BeautifulSoup(htmlsource)
x = soup.find_all("div")
for i in range(len(x)):
if x[i].get("class") and "product-list-item" in x[i].get("class"):
print(x[i].get("class"))
This will give me a list of all the "product-list-item" but then I tried something like
print(x[i].get("class").next_element)
Because I thought next_element or next_sibling should give me the next tag but it just leads to AttributeError: 'list' object has no attribute 'next_element'. So I tried with only the first list element:
print(x[i][0].get("class").next_element)
Which led to this error: return self.attrs[key] KeyError: 0.
Also tried with .find_all("href") and .get("href") but this all leads to the same errors.
EDIT3: Ok seems I found out how to solve it, now I did:
x = soup.find_all("div")
for i in range(len(x)):
if x[i].get("class") and "product-list-item" in x[i].get("class"):
print(x[i].next_element.next_element.get("href"))
This can also be shortened by using another attribute to the find_all function:
x = soup.find_all("div", "product-list-item")
for i in x:
print(i.next_element.next_element.get("href"))
greetings
I want to get all the href links in this document that are directly after the div tag with the class "product-list-item"
To find the first <a href> element in the <div>:
links = []
for div in soup.find_all('div', 'product-list-item'):
a = div.find('a', href=True) # find <a> anywhere in <div>
if a is not None:
links.append(a['href'])
It assumes that the link is inside <div>. Any elements in <div> before the first <a href> are ignored.
If you'd like; you can be more strict about it e.g., taking the link only if it is the first child in <div>:
a = div.contents[0] # take the very first child even if it is not a Tag
if a.name == 'a' and a.has_attr('href'):
links.append(a['href'])
Or if <a> is not inside <div>:
a = div.find_next('a', href=True) # find <a> that appears after <div>
if a is not None:
links.append(a['href'])
There are many ways to search and navigate in BeautifulSoup.
If you search with lxml.html, you could also use xpath and css expressions if you are familiar with them.

Categories