Get href using css selector with Scrapy - python

I want to get the href value:
<span class="title">
</span>
I tried this:
Link = Link1.css('span[class=title] a::text').extract()[0]
But I just get the text inside the <a>. How can I get the link inside the href?

What you're looking for is:
Link = Link1.css('span[class=title] a::attr(href)').extract()[0]
Since you're matching a span "class" attribute also, you can even write
Link = Link1.css('span.title a::attr(href)').extract()[0]
Please note that ::text pseudo element and ::attr(attributename) functional pseudo element are NOT standard CSS3 selectors. They're extensions to CSS selectors in Scrapy 0.20.
Edit (2017-07-20): starting from Scrapy 1.0, you can use .extract_first() instead of .extract()[0]
Link = Link1.css('span[class=title] a::attr(href)').extract_first()
Link = Link1.css('span.title a::attr(href)').extract_first()

Link = Link1.css('span.title a::attr(href)').extract_first()
you can get more infomation from this

This will do the job:
Link = Link1.css('span.title a::attr(href)').extract()
Link will have the value : https://www.example.com

Related

Extract data-content from span tag in BeautifulSoup

I have such HTML code:
<li class="IDENTIFIER"><h5 class="hidden">IDENTIFIER</h5><p>
<span class="tooltip-iws" data-toggle="popover" data-content="SOME TEXT">
other text</span></p></li>
And I'd like to obtain the SOME TEXT from the data-content.
I wrote
target = soup.find('span', {'class' : 'tooltip-iws'})['data-content']
to get the span, and I wrote
identifier_elt= soup.find("li", {'class': 'IDENTIFIER'})
to get the class, but I'm not sure how to combine the two.
But the class tooltip-iws is not unique, and I would get extraneous results if I just used that (there are other spans, before the code snippet, with the same class)
That's why I want to specify my search within the class IDENTIFIER. How can I do that in BeautifulSoup?
try using css selector,
soup.select_one("li[class='IDENTIFIER'] > p > span")['data-content']
Try using selectorlib, should solve your issue, comment if you need further assistance
https://selectorlib.com/

Optimal way to find element containing`data-superid="picture-link"?

The element I'm looking to find looks like this:
<a href="pic:/82eu92e/iwjd/" data-superid="picture-link">
Previously I found all href's in the page, then found the correct href by finding which one had the text pic:, but I can't do this any longer due to some pages having scrolling galleries causing stale elements.
You can filter by attribute:
driver.find_element_by_xpath('//a[#data-superid="picture-link"]')
Regarding the scrolling part, here is a previously asked question that can help you.
You could try beautifulsoup + selenium, like:
from bs4 import BeautifulSoup
text = '''<a href="pic:/82eu92e/iwjd/" data-superid="picture-link">'''
# Under your circumstance, you need to use:
# text = driver.page_source
soup = BeautifulSoup(text, "html.parser")
print(soup.find("a", attrs={"data-superid":"picture-link"}))
Result:
<a data-superid="picture-link" href="pic:/82eu92e/iwjd/"></a>
To Extract the href value using data-superid="picture-link" use following css selector or xpath.
links=driver.find_elements_by_css_selector("a[data-superid='picture-link'][href]")
for link in links:
print(link.get_attribute("href"))
OR
links=driver.find_elements_by_xpath("//a[#data-superid='picture-link'][#href]")
for link in links:
print(link.get_attribute("href"))

Getting href in a html line

I am using BeautifulSoup to get information from an html datasheet. Particularly, I am trying to get the href = ... in the following line:
<a class="block" href="/post/BpkL7ColOVj" style="background-image: url(https://scontent-ort2-2.cdninstagram.com/vp/09e1b7436c9125092433c041c35c1eaa/5BDB064D/t51.2885-15/e15/s480x480/43913877_2130106893692252_5245480330715053223_n.jpg)">
soup.find_all('a', attrs={'class':'block'})
Is there any other way using BeautifulSoup to get what is contained in the href?
Thanks!
Just use ['attribute_name'] this will get attributes by their name.
soup.find_all('a', attrs={'class':'block'})[0]['href']
>>> '/post/BpkL7ColOVj'
You can also use css selector which I think is more straightforward:
soup.select('a.block')[0]['href'] # same thing.

Extract Link URL After Specified Element with Python and Beautifulsoup4

I'm trying to extract a link from a page with python and the beautifulsoup library, but I'm stuck. The link is on the following page, on the sidebar area, directly underneath the h4 subtitle "Original Source:
http://www.eurekalert.org/pub_releases/2016-06/uonc-euc062016.php
I've managed to isolate the link (mostly), but I'm unsure of how to further advance my targeting to actually extract the link. Here's my code so far:
import requests
from bs4 import BeautifulSoup
url = "http://www.eurekalert.org/pub_releases/2016-06/uonc-euc062016.php"
data = requests.get(url)
soup = BeautifulSoup(data.text, 'lxml')
source_url = soup.find('section', class_='widget hidden-print').find('div', class_='widget-content').findAll('a')[-1]
print(source_url)
I am currently getting the full html of the last element in which I've isolated, where I'm trying to simply get the link. Of note, this is the only link on the page I'm trying to get.
You're looking for the link which is the href html attribute. source_url is a bs4.element.Tag which has the get method like:
source_url.get('href')
You almost got it!!
SOLUTION 1:
You just have to run the .text method on the soup you've assigned to source_url.
So instead of:
print(source_url)
You should use:
print(source_url.text)
Output:
http://news.unchealthcare.org/news/2016/june/e-cigarette-use-can-alter-hundreds-of-genes-involved-in-airway-immune-defense
SOLUTION 2:
You should call source_url.get('href') to get only the specific href tag related to your soup.findall element.
print source_url.get('href')
Output:
http://news.unchealthcare.org/news/2016/june/e-cigarette-use-can-alter-hundreds-of-genes-involved-in-airway-immune-defense

Python 3, beautiful soup, get next tag

I have the following html part which repeates itself several times with other href links:
<div class="product-list-item margin-bottom">
<a title="titleexample" href="http://www.urlexample.com/example_1" data-style-id="sp_2866">
Now I want to get all the href links in this document that are directly after the div tag with the class "product-list-item".
Pretty new to beautifulsoup and nothing that I came up with worked.
Thanks for your ideas.
EDIT: Does not really have to be beautifulsoup; when it can be done with regex and the python html parser this is also ok.
EDIT2: What I tried (I'm pretty new to python, so what I did might be totaly stupid from an advanced viewpoint):
soup = bs4.BeautifulSoup(htmlsource)
x = soup.find_all("div")
for i in range(len(x)):
if x[i].get("class") and "product-list-item" in x[i].get("class"):
print(x[i].get("class"))
This will give me a list of all the "product-list-item" but then I tried something like
print(x[i].get("class").next_element)
Because I thought next_element or next_sibling should give me the next tag but it just leads to AttributeError: 'list' object has no attribute 'next_element'. So I tried with only the first list element:
print(x[i][0].get("class").next_element)
Which led to this error: return self.attrs[key] KeyError: 0.
Also tried with .find_all("href") and .get("href") but this all leads to the same errors.
EDIT3: Ok seems I found out how to solve it, now I did:
x = soup.find_all("div")
for i in range(len(x)):
if x[i].get("class") and "product-list-item" in x[i].get("class"):
print(x[i].next_element.next_element.get("href"))
This can also be shortened by using another attribute to the find_all function:
x = soup.find_all("div", "product-list-item")
for i in x:
print(i.next_element.next_element.get("href"))
greetings
I want to get all the href links in this document that are directly after the div tag with the class "product-list-item"
To find the first <a href> element in the <div>:
links = []
for div in soup.find_all('div', 'product-list-item'):
a = div.find('a', href=True) # find <a> anywhere in <div>
if a is not None:
links.append(a['href'])
It assumes that the link is inside <div>. Any elements in <div> before the first <a href> are ignored.
If you'd like; you can be more strict about it e.g., taking the link only if it is the first child in <div>:
a = div.contents[0] # take the very first child even if it is not a Tag
if a.name == 'a' and a.has_attr('href'):
links.append(a['href'])
Or if <a> is not inside <div>:
a = div.find_next('a', href=True) # find <a> that appears after <div>
if a is not None:
links.append(a['href'])
There are many ways to search and navigate in BeautifulSoup.
If you search with lxml.html, you could also use xpath and css expressions if you are familiar with them.

Categories