Getting href in a html line - python

I am using BeautifulSoup to get information from an html datasheet. Particularly, I am trying to get the href = ... in the following line:
<a class="block" href="/post/BpkL7ColOVj" style="background-image: url(https://scontent-ort2-2.cdninstagram.com/vp/09e1b7436c9125092433c041c35c1eaa/5BDB064D/t51.2885-15/e15/s480x480/43913877_2130106893692252_5245480330715053223_n.jpg)">
soup.find_all('a', attrs={'class':'block'})
Is there any other way using BeautifulSoup to get what is contained in the href?
Thanks!

Just use ['attribute_name'] this will get attributes by their name.
soup.find_all('a', attrs={'class':'block'})[0]['href']
>>> '/post/BpkL7ColOVj'
You can also use css selector which I think is more straightforward:
soup.select('a.block')[0]['href'] # same thing.

Related

Optimal way to find element containing`data-superid="picture-link"?

The element I'm looking to find looks like this:
<a href="pic:/82eu92e/iwjd/" data-superid="picture-link">
Previously I found all href's in the page, then found the correct href by finding which one had the text pic:, but I can't do this any longer due to some pages having scrolling galleries causing stale elements.
You can filter by attribute:
driver.find_element_by_xpath('//a[#data-superid="picture-link"]')
Regarding the scrolling part, here is a previously asked question that can help you.
You could try beautifulsoup + selenium, like:
from bs4 import BeautifulSoup
text = '''<a href="pic:/82eu92e/iwjd/" data-superid="picture-link">'''
# Under your circumstance, you need to use:
# text = driver.page_source
soup = BeautifulSoup(text, "html.parser")
print(soup.find("a", attrs={"data-superid":"picture-link"}))
Result:
<a data-superid="picture-link" href="pic:/82eu92e/iwjd/"></a>
To Extract the href value using data-superid="picture-link" use following css selector or xpath.
links=driver.find_elements_by_css_selector("a[data-superid='picture-link'][href]")
for link in links:
print(link.get_attribute("href"))
OR
links=driver.find_elements_by_xpath("//a[#data-superid='picture-link'][#href]")
for link in links:
print(link.get_attribute("href"))

Beautiful soup find_all doesn't find CSS selector with multiple classes

On the website there is this <a> element
<a role="listitem" aria-level="1" href="https://www.rest.co.il" target="_blank" class="icon rest" title="this is main title" iconwidth="35px" aria-label="website connection" style="width: 30px; overflow: hidden;"></a>
So I use this code to catch the element
(note the find_all argument a.icon.rest)
import requests
from bs4 import BeautifulSoup
url = 'http://www.zap.co.il/models.aspx?sog=e-cellphone&pageinfo=1'
source_code = requests.get(url)
plain_text = source_code.text
soup = BeautifulSoup(plain_text, "html.parser")
for link in soup.find_all("a.icon.rest"):
x = link.get('href')
print(x)
Which unfortunately returns nothing
although the beautiful soup documentation clearly says:
If you want to search for tags that match two or more CSS classes, you
should use a CSS selector:
css_soup.select("p.strikeout.body")
returns: <p class="body strikeout"></p>
So why isn't this working?
By the way, I'm using pycharm
As the docs you quoted explain, if you want to search for tags that match two CSS classes, you have to use a CSS selector instead of a find_all. The example you quoted shows how to do that:
css_soup.select("p.strikeout.body")
But you didn't do that; you used find_all anyway, and of course it didn't work, because find_all doesn't take a CSS selector.
Change it to use select, which does take a CSS selector, and it will work.

soup.find_all works but soup.select doesn't work

I'm playing around with parsing an html page using css selectors
import requests
import webbrowser
from bs4 import BeautifulSoup
page = requests.get('http://www.marketwatch.com', headers={'User-Agent': 'Mozilla/5.0'})
soup = BeautifulSoup(page.content, 'html.parser')
I'm having trouble selecting a list tag with a class when using the select method. However, I have no trouble when using the find_all method
soup.find_all('ul', class_= "latestNews j-scrollElement")
This returns the output I desire, but for some reason I can't do the same using
css selectors. I want to know what I'm doing wrong.
Here is my attempt:
soup.select("ul .latestNews j-scrollElement")
which is returning an empty list.
I can't figure out what I'm doing wrong with the select method.
Thank you.
From the documentation:
If you want to search for tags that match two or more CSS classes, you
should use a CSS selector:
css_soup.select("p.strikeout.body")
In your case, you'd call it like this:
In [1588]: soup.select("ul.latestNews.j-scrollElement")
Out[1588]:
[<ul class="latestNews j-scrollElement" data-track-code="MW_Header_Latest News|MW_Header_Latest News_Facebook|MW_Header_Latest News_Twitter" data-track-query=".latestNews__headline a|a.icon--facebook|a.icon--twitter">
.
.
.

Extract Link URL After Specified Element with Python and Beautifulsoup4

I'm trying to extract a link from a page with python and the beautifulsoup library, but I'm stuck. The link is on the following page, on the sidebar area, directly underneath the h4 subtitle "Original Source:
http://www.eurekalert.org/pub_releases/2016-06/uonc-euc062016.php
I've managed to isolate the link (mostly), but I'm unsure of how to further advance my targeting to actually extract the link. Here's my code so far:
import requests
from bs4 import BeautifulSoup
url = "http://www.eurekalert.org/pub_releases/2016-06/uonc-euc062016.php"
data = requests.get(url)
soup = BeautifulSoup(data.text, 'lxml')
source_url = soup.find('section', class_='widget hidden-print').find('div', class_='widget-content').findAll('a')[-1]
print(source_url)
I am currently getting the full html of the last element in which I've isolated, where I'm trying to simply get the link. Of note, this is the only link on the page I'm trying to get.
You're looking for the link which is the href html attribute. source_url is a bs4.element.Tag which has the get method like:
source_url.get('href')
You almost got it!!
SOLUTION 1:
You just have to run the .text method on the soup you've assigned to source_url.
So instead of:
print(source_url)
You should use:
print(source_url.text)
Output:
http://news.unchealthcare.org/news/2016/june/e-cigarette-use-can-alter-hundreds-of-genes-involved-in-airway-immune-defense
SOLUTION 2:
You should call source_url.get('href') to get only the specific href tag related to your soup.findall element.
print source_url.get('href')
Output:
http://news.unchealthcare.org/news/2016/june/e-cigarette-use-can-alter-hundreds-of-genes-involved-in-airway-immune-defense

Beautiful Soup - Cannot find the tags

The page is: http://item.taobao.com/item.htm?id=13015989524
you can see its source code.
In its source code the following code exists
<a href="http://item.taobao.com/item.htm?id=13015989524" target="_blank">
But when I use BeautifulSoup to read the source code and execute the following
soup.findAll('a', href="http://item.taobao.com/item.htm?id=13015989524")
It returns [] empty. What does it return '[]'?
As far as I can see, the <a> tag you are trying to find is inside a <textarea> tag. BS does not parse the contents of <textarea> as HTML, and rightly so since <textarea> should not contain HTML. In short, that page is doing something sketchy.
If you really need to get that, you might "cheat" and parse the contents of <textarea> again and search within them:
import urllib
from BeautifulSoup import BeautifulSoup as BS
soup = BS(urllib.urlopen("http://item.taobao.com/item.htm?id=13015989524"))
a = []
for textarea in soup.findAll("textarea"):
textsoup = BS(textarea.text) # parse the contents as html
a.extend(textsoup.findAll("a", attrs={"href":"http://item.taobao.com/item.htm?id=13015989524"}))
for tag in a:
print tag
# outputs
# <a href="http://item.taobao.com/item.htm?id=13015989524" target="_blank"><img ...
# <a href="http://item.taobao.com/item.htm?id=13015989524" title="901 ...
Use a dictionary to store the attribute:
soup.findAll('a', {
'href': "http://item.taobao.com/item.htm?id=13015989524"
})

Categories