Extract specific HREF with xpath or css

Extract specific HREF with xpath or css - python

recently I have tackled one unusual element that's not trivial to scrape. Could you suggest please how to retrieve the href please.
I am scraping some Tripadvisor's restaurants with python scrapy and need to retrieve Google Map's link (href attribute) from location and contacts section. Could you suggest how to
The webpage for example (link)
The code of the element:
<a data-encoded-url="S0k3X2h0dHBzOi8vbWFwcy5nb29nbGUuY29tL21hcHM/c2FkZHI9JmRhZGRyPVNjYWJlbGxzdHIuKzEwLTExJTJDKzE0MTA5K0JlcmxpbitHZXJtYW55QDUyLjQyODgxOCwxMy4xODI0MjFfeVBw" class="_2wKz--mA _27M8V6YV" target="_blank" href="**https://maps.google.com/maps?saddr=&daddr=Scabellstr.+10-11%2C+14109+Berlin+Germany#52.428818,13.182421**"><span class="_2saB_OSe">Scabellstr. 10-11, 14109 Berlin Germany</span><span class="ui_icon external-link-no-box _2OpUzCuO"></span></a>
I have tried the following XPATH, but got None as response every time or couldn't get data on the href attribute as if it doesn't exist.
response.xpath("//a[contains(#class, '_2wKz--mA _27M8V6YV')]").getall()
The output:
['<a data-encoded-url="Z3pLX2h0dHBzOi8vbWFwcy5nb29nbGUuY29tL21hcHM/c2FkZHI9JmRhZGRyPVNjYWJlbGxzdHIuKzEwLTExJTJDKzE0MTA5K0JlcmxpbitHZXJtYW55QDUyLjQyODgxOCwxMy4xODI0MjFfMk1z" class="_2wKz--mA _27M8V6YV" target="_blank"><span class="_2saB_OSe">Scabellstr. 10-11, 14109 Berlin Germany</span><span class="ui_icon external-link-no-box _2OpUzCuO"></span></a>',
'Website']

Use the data-encoded-url that you already got and decode it using Base64. Example:
>>> import base64
>>> base64.b64decode("Z3pLX2h0dHBzOi8vbWFwcy5nb29nbGUuY29tL21hcHM/c2FkZHI9JmRhZGRyPVNjYWJlbGxzdHIuKzEwLTExJTJDKzE0MTA5K0JlcmxpbitHZXJtYW55QDUyLjQyODgxOCwxMy4xODI0MjFfMk1z").decode("utf-8")
'gzK_https://maps.google.com/maps?saddr=&daddr=Scabellstr.+10-11%2C+14109+Berlin+Germany#52.428818,13.182421_2Ms'
You can then remove the gzK_ prefix and _2Ms suffix and you will have your URL.

You try the specific XPath query to get the href like "//a[contains(#class, 'foobar')]/#href" to retrieve a specific attribute of the element.

Related

Python Web-scraping youtube.com BeautifulSoup4 problem

I am trying to get the author of every video on the YouTube homepage by web-scraping with BeautifulSoup4.
This is the chunk of HTML I am trying to navigate to.
<a class="yt-simple-endpoint style-scope yt-formatted-string" spellcheck="false" href="/c/ApertureScience" dir="auto">Aperture</a>
With the link: https://www.youtube.com/
And I am trying to get the item "Aperture".
The problem is that I can't seem to navigate correctly to the data, I have been trying this:
source = urllib.request.urlopen('https://www.youtube.com/').read()
soup = bs.BeautifulSoup(source,'lxml')
for i in soup.find_all('a', class_='yt-simple-endpoint style-scope yt-formatted-string'):
print(i)
And nothing prints, I think it is because of the weird spaces in the class name but I don't know how to get around that.
If any ideas help, thank you!

try the syntax:
find_all('a',{'class' : 'yt-simple-endpoint style-scope yt-formatted-string'})
and for the 'Aperture' use string or content or text.
And if the content is Dynamic, you could use Selenium.

Selenium driver cannot return href attribute

I am trying to read the 'href' attribute from a website. Now I have the problem that a 'div' has several 'a'. From the second 'a' the 'href'-attribute can be easily read, but not from the second 'a'.
This is the following website:
https://www.google.ca/search?q=Jahresringe+Holz&hl=en&authuser=0&tbm=isch&source=hp&biw=&bih=&ei=mPc1YevoA4Svggfjk634CA
and from this website I look at the first picture.
Here is the HTML code of the website, unfortunately as an image, because I could not paste the code: HTML Code
My Python Code:
for i in range(1,200):
xPathOfAllA = '//*[#id="islrg"]/div[1]/div['+str(i)+']/a'
el = driver.find_elements_by_xpath(xPathOfAllA)
href = el[0].get_attribute('href') #Returning: None
href2 = el[1].get_attribute('href') #Returning: https://www.vv[...]
[...]
The right result should be: /imgres?imgurl[...]
Thank for every help and I have also read the other stack overflow entries, but my problem seems to be quite different.

for first iteration of the loop that you've
el = driver.find_elements_by_xpath(xPathOfAllA)
el[0], represent
<a class="wXeWr islib nfEiy" jsname="sTFXNd" jsaction="J9iaEb;mousedown:npT2md; touchstart:npT2md;" data-nav="1" tabindex="0" style="height: 180px;">
which does not have href so it's obvious that it would return None. As you mentioned in comment als0 #Returning: None
Now look for second element :
<a class="VFACy kGQAp sMi44c lNHeqe WGvvNb" data-ved="2ahUKEwi42aPOperyAhU-BLcAHf1VD_UQr4kDegUIARC7AQ" jsname="uy6ald" rel="noopener" target="_blank" href="https://de.wikipedia.org/wiki/Jahresring" jsaction="focus:kvVbVb;mousedown:kvVbVb;touchstart:kvVbVb;" title="Jahresring – Wikipedia">Jahresring – Wikipedia<div class="fxgdke">de.wikipedia.org</div></a>
this has href, href="https://de.wikipedia.org/wiki/Jahresring" so is the reason you are getting #Returning: https://www.vv[...]
Solution :
You can filter your xpath expression, If I have understood your question correctly you are looking for all the a tag with href ?
if so, use the below xpath :
//a[contains(#href,'')]
with find_elements to have a list of web elements then simply get the attribute href and either store in into a list or print it on console.
Update 1 :
//div[1]/div[1]/a[1]/div[1]/img/../..
the above mentioned xpath should give you first image href, if you call
a = driver.find_element_by_xpath("//div[1]/div[1]/a[1]/div[1]/img/../..").get_attribute('href')
print(a)

Optimal way to find element containing`data-superid="picture-link"?

The element I'm looking to find looks like this:
<a href="pic:/82eu92e/iwjd/" data-superid="picture-link">
Previously I found all href's in the page, then found the correct href by finding which one had the text pic:, but I can't do this any longer due to some pages having scrolling galleries causing stale elements.

You can filter by attribute:
driver.find_element_by_xpath('//a[#data-superid="picture-link"]')
Regarding the scrolling part, here is a previously asked question that can help you.

You could try beautifulsoup + selenium, like:
from bs4 import BeautifulSoup
text = '''<a href="pic:/82eu92e/iwjd/" data-superid="picture-link">'''
# Under your circumstance, you need to use:
# text = driver.page_source
soup = BeautifulSoup(text, "html.parser")
print(soup.find("a", attrs={"data-superid":"picture-link"}))
Result:
<a data-superid="picture-link" href="pic:/82eu92e/iwjd/"></a>

To Extract the href value using data-superid="picture-link" use following css selector or xpath.
links=driver.find_elements_by_css_selector("a[data-superid='picture-link'][href]")
for link in links:
print(link.get_attribute("href"))
OR
links=driver.find_elements_by_xpath("//a[#data-superid='picture-link'][#href]")
for link in links:
print(link.get_attribute("href"))

Pull Title attribute with out .get("title")

I'm having value like
<a href="/for-sale/property/abu-dhabi/page-3/" title="Next" class="b7880daf"><div title="Next" class="ea747e34 ">
I need to pull out only ""Next" from title="Next" for the one i used
soup.find('a',attrs={"title": "Next"}).get('title')
is there any method to get the tittle value with out using .get("title")
My code
next_page_text = soup.find('a',attrs={"title": "Next"}).get('title')
Output:
Next
I need:
next_page_text = soup.find('a',attrs={"title": "Next"})
Output:
Next
Please let me know if there is any method to find.

You should get Next.Try this. Using find() or select_one() and Use If to check if element is present on a page.
from bs4 import BeautifulSoup
import requests
res=requests.get("https://www.bayut.com/for-sale/property/abu-dhabi/page-182/")
soup=BeautifulSoup(res.text,"html.parser")
if soup.find("a", attrs={"title": "Next"}):
print(soup.find("a", attrs={"title": "Next"})['title'])
If you want to use css selector.
if soup.select_one("a[title='Next']"):
print(soup.select_one("a[title='Next']")['title'])

I'm re-writing my answer as there was confusion in your original post.
If you'd like to take the URL associated with the Next tag:
soup.find('a', title='Next')['href']
['href'] can be replaced with any other attribute in the element, so title, itemprop etc.
If you'd like to select the element with Next in the title:
soup.find('a', title='Next')

Getting href in a html line

I am using BeautifulSoup to get information from an html datasheet. Particularly, I am trying to get the href = ... in the following line:
<a class="block" href="/post/BpkL7ColOVj" style="background-image: url(https://scontent-ort2-2.cdninstagram.com/vp/09e1b7436c9125092433c041c35c1eaa/5BDB064D/t51.2885-15/e15/s480x480/43913877_2130106893692252_5245480330715053223_n.jpg)">
soup.find_all('a', attrs={'class':'block'})
Is there any other way using BeautifulSoup to get what is contained in the href?
Thanks!

Just use ['attribute_name'] this will get attributes by their name.
soup.find_all('a', attrs={'class':'block'})[0]['href']
>>> '/post/BpkL7ColOVj'
You can also use css selector which I think is more straightforward:
soup.select('a.block')[0]['href'] # same thing.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Extract specific HREF with xpath or css - python

You try the specific XPath query to get the href like "//a[contains(#class, 'foobar')]/#href" to retrieve a specific attribute of the element.

Related

Python Web-scraping youtube.com BeautifulSoup4 problem

Selenium driver cannot return href attribute

Optimal way to find element containing`data-superid="picture-link"?

Pull Title attribute with out .get("title")

Getting href in a html line

Categories

Resources