recently I have tackled one unusual element that's not trivial to scrape. Could you suggest please how to retrieve the href please.
I am scraping some Tripadvisor's restaurants with python scrapy and need to retrieve Google Map's link (href attribute) from location and contacts section. Could you suggest how to
The webpage for example (link)
The code of the element:
<a data-encoded-url="S0k3X2h0dHBzOi8vbWFwcy5nb29nbGUuY29tL21hcHM/c2FkZHI9JmRhZGRyPVNjYWJlbGxzdHIuKzEwLTExJTJDKzE0MTA5K0JlcmxpbitHZXJtYW55QDUyLjQyODgxOCwxMy4xODI0MjFfeVBw" class="_2wKz--mA _27M8V6YV" target="_blank" href="**https://maps.google.com/maps?saddr=&daddr=Scabellstr.+10-11%2C+14109+Berlin+Germany#52.428818,13.182421**"><span class="_2saB_OSe">Scabellstr. 10-11, 14109 Berlin Germany</span><span class="ui_icon external-link-no-box _2OpUzCuO"></span></a>
I have tried the following XPATH, but got None as response every time or couldn't get data on the href attribute as if it doesn't exist.
response.xpath("//a[contains(#class, '_2wKz--mA _27M8V6YV')]").getall()
The output:
['<a data-encoded-url="Z3pLX2h0dHBzOi8vbWFwcy5nb29nbGUuY29tL21hcHM/c2FkZHI9JmRhZGRyPVNjYWJlbGxzdHIuKzEwLTExJTJDKzE0MTA5K0JlcmxpbitHZXJtYW55QDUyLjQyODgxOCwxMy4xODI0MjFfMk1z" class="_2wKz--mA _27M8V6YV" target="_blank"><span class="_2saB_OSe">Scabellstr. 10-11, 14109 Berlin Germany</span><span class="ui_icon external-link-no-box _2OpUzCuO"></span></a>',
'Website']
Use the data-encoded-url that you already got and decode it using Base64. Example:
>>> import base64
>>> base64.b64decode("Z3pLX2h0dHBzOi8vbWFwcy5nb29nbGUuY29tL21hcHM/c2FkZHI9JmRhZGRyPVNjYWJlbGxzdHIuKzEwLTExJTJDKzE0MTA5K0JlcmxpbitHZXJtYW55QDUyLjQyODgxOCwxMy4xODI0MjFfMk1z").decode("utf-8")
'gzK_https://maps.google.com/maps?saddr=&daddr=Scabellstr.+10-11%2C+14109+Berlin+Germany#52.428818,13.182421_2Ms'
You can then remove the gzK_ prefix and _2Ms suffix and you will have your URL.
You try the specific XPath query to get the href like "//a[contains(#class, 'foobar')]/#href" to retrieve a specific attribute of the element.
Related
I am trying to get the author of every video on the YouTube homepage by web-scraping with BeautifulSoup4.
This is the chunk of HTML I am trying to navigate to.
<a class="yt-simple-endpoint style-scope yt-formatted-string" spellcheck="false" href="/c/ApertureScience" dir="auto">Aperture</a>
With the link: https://www.youtube.com/
And I am trying to get the item "Aperture".
The problem is that I can't seem to navigate correctly to the data, I have been trying this:
source = urllib.request.urlopen('https://www.youtube.com/').read()
soup = bs.BeautifulSoup(source,'lxml')
for i in soup.find_all('a', class_='yt-simple-endpoint style-scope yt-formatted-string'):
print(i)
And nothing prints, I think it is because of the weird spaces in the class name but I don't know how to get around that.
If any ideas help, thank you!
try the syntax:
find_all('a',{'class' : 'yt-simple-endpoint style-scope yt-formatted-string'})
and for the 'Aperture' use string or content or text.
And if the content is Dynamic, you could use Selenium.
I am trying to read the 'href' attribute from a website. Now I have the problem that a 'div' has several 'a'. From the second 'a' the 'href'-attribute can be easily read, but not from the second 'a'.
This is the following website:
https://www.google.ca/search?q=Jahresringe+Holz&hl=en&authuser=0&tbm=isch&source=hp&biw=&bih=&ei=mPc1YevoA4Svggfjk634CA
and from this website I look at the first picture.
Here is the HTML code of the website, unfortunately as an image, because I could not paste the code: HTML Code
My Python Code:
for i in range(1,200):
xPathOfAllA = '//*[#id="islrg"]/div[1]/div['+str(i)+']/a'
el = driver.find_elements_by_xpath(xPathOfAllA)
href = el[0].get_attribute('href') #Returning: None
href2 = el[1].get_attribute('href') #Returning: https://www.vv[...]
[...]
The right result should be: /imgres?imgurl[...]
Thank for every help and I have also read the other stack overflow entries, but my problem seems to be quite different.
for first iteration of the loop that you've
el = driver.find_elements_by_xpath(xPathOfAllA)
el[0], represent
<a class="wXeWr islib nfEiy" jsname="sTFXNd" jsaction="J9iaEb;mousedown:npT2md; touchstart:npT2md;" data-nav="1" tabindex="0" style="height: 180px;">
which does not have href so it's obvious that it would return None. As you mentioned in comment als0 #Returning: None
Now look for second element :
<a class="VFACy kGQAp sMi44c lNHeqe WGvvNb" data-ved="2ahUKEwi42aPOperyAhU-BLcAHf1VD_UQr4kDegUIARC7AQ" jsname="uy6ald" rel="noopener" target="_blank" href="https://de.wikipedia.org/wiki/Jahresring" jsaction="focus:kvVbVb;mousedown:kvVbVb;touchstart:kvVbVb;" title="Jahresring – Wikipedia">Jahresring – Wikipedia<div class="fxgdke">de.wikipedia.org</div></a>
this has href, href="https://de.wikipedia.org/wiki/Jahresring" so is the reason you are getting #Returning: https://www.vv[...]
Solution :
You can filter your xpath expression, If I have understood your question correctly you are looking for all the a tag with href ?
if so, use the below xpath :
//a[contains(#href,'')]
with find_elements to have a list of web elements then simply get the attribute href and either store in into a list or print it on console.
Update 1 :
//div[1]/div[1]/a[1]/div[1]/img/../..
the above mentioned xpath should give you first image href, if you call
a = driver.find_element_by_xpath("//div[1]/div[1]/a[1]/div[1]/img/../..").get_attribute('href')
print(a)
The element I'm looking to find looks like this:
<a href="pic:/82eu92e/iwjd/" data-superid="picture-link">
Previously I found all href's in the page, then found the correct href by finding which one had the text pic:, but I can't do this any longer due to some pages having scrolling galleries causing stale elements.
You can filter by attribute:
driver.find_element_by_xpath('//a[#data-superid="picture-link"]')
Regarding the scrolling part, here is a previously asked question that can help you.
You could try beautifulsoup + selenium, like:
from bs4 import BeautifulSoup
text = '''<a href="pic:/82eu92e/iwjd/" data-superid="picture-link">'''
# Under your circumstance, you need to use:
# text = driver.page_source
soup = BeautifulSoup(text, "html.parser")
print(soup.find("a", attrs={"data-superid":"picture-link"}))
Result:
<a data-superid="picture-link" href="pic:/82eu92e/iwjd/"></a>
To Extract the href value using data-superid="picture-link" use following css selector or xpath.
links=driver.find_elements_by_css_selector("a[data-superid='picture-link'][href]")
for link in links:
print(link.get_attribute("href"))
OR
links=driver.find_elements_by_xpath("//a[#data-superid='picture-link'][#href]")
for link in links:
print(link.get_attribute("href"))
I'm having value like
<a href="/for-sale/property/abu-dhabi/page-3/" title="Next" class="b7880daf"><div title="Next" class="ea747e34 ">
I need to pull out only ""Next" from title="Next" for the one i used
soup.find('a',attrs={"title": "Next"}).get('title')
is there any method to get the tittle value with out using .get("title")
My code
next_page_text = soup.find('a',attrs={"title": "Next"}).get('title')
Output:
Next
I need:
next_page_text = soup.find('a',attrs={"title": "Next"})
Output:
Next
Please let me know if there is any method to find.
You should get Next.Try this. Using find() or select_one() and Use If to check if element is present on a page.
from bs4 import BeautifulSoup
import requests
res=requests.get("https://www.bayut.com/for-sale/property/abu-dhabi/page-182/")
soup=BeautifulSoup(res.text,"html.parser")
if soup.find("a", attrs={"title": "Next"}):
print(soup.find("a", attrs={"title": "Next"})['title'])
If you want to use css selector.
if soup.select_one("a[title='Next']"):
print(soup.select_one("a[title='Next']")['title'])
I'm re-writing my answer as there was confusion in your original post.
If you'd like to take the URL associated with the Next tag:
soup.find('a', title='Next')['href']
['href'] can be replaced with any other attribute in the element, so title, itemprop etc.
If you'd like to select the element with Next in the title:
soup.find('a', title='Next')
I am using BeautifulSoup to get information from an html datasheet. Particularly, I am trying to get the href = ... in the following line:
<a class="block" href="/post/BpkL7ColOVj" style="background-image: url(https://scontent-ort2-2.cdninstagram.com/vp/09e1b7436c9125092433c041c35c1eaa/5BDB064D/t51.2885-15/e15/s480x480/43913877_2130106893692252_5245480330715053223_n.jpg)">
soup.find_all('a', attrs={'class':'block'})
Is there any other way using BeautifulSoup to get what is contained in the href?
Thanks!
Just use ['attribute_name'] this will get attributes by their name.
soup.find_all('a', attrs={'class':'block'})[0]['href']
>>> '/post/BpkL7ColOVj'
You can also use css selector which I think is more straightforward:
soup.select('a.block')[0]['href'] # same thing.