Optimal way to find element containing`data-superid="picture-link"? - python

The element I'm looking to find looks like this:
<a href="pic:/82eu92e/iwjd/" data-superid="picture-link">
Previously I found all href's in the page, then found the correct href by finding which one had the text pic:, but I can't do this any longer due to some pages having scrolling galleries causing stale elements.

You can filter by attribute:
driver.find_element_by_xpath('//a[#data-superid="picture-link"]')
Regarding the scrolling part, here is a previously asked question that can help you.

You could try beautifulsoup + selenium, like:
from bs4 import BeautifulSoup
text = '''<a href="pic:/82eu92e/iwjd/" data-superid="picture-link">'''
# Under your circumstance, you need to use:
# text = driver.page_source
soup = BeautifulSoup(text, "html.parser")
print(soup.find("a", attrs={"data-superid":"picture-link"}))
Result:
<a data-superid="picture-link" href="pic:/82eu92e/iwjd/"></a>

To Extract the href value using data-superid="picture-link" use following css selector or xpath.
links=driver.find_elements_by_css_selector("a[data-superid='picture-link'][href]")
for link in links:
print(link.get_attribute("href"))
OR
links=driver.find_elements_by_xpath("//a[#data-superid='picture-link'][#href]")
for link in links:
print(link.get_attribute("href"))

Related

Getting href in a html line

I am using BeautifulSoup to get information from an html datasheet. Particularly, I am trying to get the href = ... in the following line:
<a class="block" href="/post/BpkL7ColOVj" style="background-image: url(https://scontent-ort2-2.cdninstagram.com/vp/09e1b7436c9125092433c041c35c1eaa/5BDB064D/t51.2885-15/e15/s480x480/43913877_2130106893692252_5245480330715053223_n.jpg)">
soup.find_all('a', attrs={'class':'block'})
Is there any other way using BeautifulSoup to get what is contained in the href?
Thanks!
Just use ['attribute_name'] this will get attributes by their name.
soup.find_all('a', attrs={'class':'block'})[0]['href']
>>> '/post/BpkL7ColOVj'
You can also use css selector which I think is more straightforward:
soup.select('a.block')[0]['href'] # same thing.

Extract Link URL After Specified Element with Python and Beautifulsoup4

I'm trying to extract a link from a page with python and the beautifulsoup library, but I'm stuck. The link is on the following page, on the sidebar area, directly underneath the h4 subtitle "Original Source:
http://www.eurekalert.org/pub_releases/2016-06/uonc-euc062016.php
I've managed to isolate the link (mostly), but I'm unsure of how to further advance my targeting to actually extract the link. Here's my code so far:
import requests
from bs4 import BeautifulSoup
url = "http://www.eurekalert.org/pub_releases/2016-06/uonc-euc062016.php"
data = requests.get(url)
soup = BeautifulSoup(data.text, 'lxml')
source_url = soup.find('section', class_='widget hidden-print').find('div', class_='widget-content').findAll('a')[-1]
print(source_url)
I am currently getting the full html of the last element in which I've isolated, where I'm trying to simply get the link. Of note, this is the only link on the page I'm trying to get.
You're looking for the link which is the href html attribute. source_url is a bs4.element.Tag which has the get method like:
source_url.get('href')
You almost got it!!
SOLUTION 1:
You just have to run the .text method on the soup you've assigned to source_url.
So instead of:
print(source_url)
You should use:
print(source_url.text)
Output:
http://news.unchealthcare.org/news/2016/june/e-cigarette-use-can-alter-hundreds-of-genes-involved-in-airway-immune-defense
SOLUTION 2:
You should call source_url.get('href') to get only the specific href tag related to your soup.findall element.
print source_url.get('href')
Output:
http://news.unchealthcare.org/news/2016/june/e-cigarette-use-can-alter-hundreds-of-genes-involved-in-airway-immune-defense

Can Beautiful Soup parse hidden attributes?

So I used Beautiful Soup in python to parse a page that displays all my facebook friends.Here's my code:
import requests
from bs4 import BeautifulSoup
r=requests.get("https://www.facebook.com/xxx.xxx/friendspnref=lhc")
soup=BeautifulSoup(r.content)
for link in soup.find_all("a"):
print link.get('href')
The thing is it displays a lot of links but none of them are links to my friends' profiles,which are displayed normally on the webpage.
On doing Inspect element I fount this
<div class="hidden_elem"><code id="u_0_2m"><!--
The code continues,and the links to their profiles are commented within an li tag in the div tag.
Two questions mainly:
(1.)What does this mean and why can't Beautiful Soup read them?
(2.)Is there a way to read them?
I really don't plan to achieve anything by this ,just curious.

problems scraping web page using python

Hi I'm quite new to python and my boss has asked me to scrape this data however it is not my strong point so i was wondering how i would go about this.
The text that I'm after also changes in the quote marks every few minutes so I'm also not sure how to locate that.
I am using beautiful soup at the moment and Lxml however if there are better alternatives I'm happy to try them
This is the inspected element of the webpage:
div class = "sometext"
<h3> somemoretext </h3>
<p>
<span class = "title" title="text i want">text i want</span>
<br>
</p>
I have tried using:
from lxml import html
import requests
from bs4 import BeautifulSoup
page = requests.get('the url')
soup = BeautifulSoup(page.text)
r = soup.findAll('//span[#class="title"]/text()')
print r
Thank you in advance,any help would be appreciated!
First do this to get what you are looking at in the soup:
soup = BeautifulSoup(page)
print soup
That way you can double check that you are actually dealing will what you think you are dealing with.
Then do this:
r = soup.findAll('span', attrs={"class":"title"})
for span in r:
print span.text
This will get all the span tags with a class=title, and then text will print out all the text in between the tags.
Edited to Add
Note that esecules' answer will get you the title within the tag (<span class = "title" title="text i want">) whereas mine will get the title from the text (<span class = "title" >text i want</span>)
perhaps find is the method you really need since you're only ever looking for one element. docs
r = soup.find('div', 'sometext').find('span','title')['title']
if you're familiar with XPath and you don't need feature that specific to BeautifulSoup, then using lxml only is enough (or maybe even better since lxml is known to be faster) :
from lxml import html
import requests
page = requests.get('the url')
root = html.fromstring(page.text)
r = root.xpath('//span[#class="title"]/text()')
print r

Get href using css selector with Scrapy

I want to get the href value:
<span class="title">
</span>
I tried this:
Link = Link1.css('span[class=title] a::text').extract()[0]
But I just get the text inside the <a>. How can I get the link inside the href?
What you're looking for is:
Link = Link1.css('span[class=title] a::attr(href)').extract()[0]
Since you're matching a span "class" attribute also, you can even write
Link = Link1.css('span.title a::attr(href)').extract()[0]
Please note that ::text pseudo element and ::attr(attributename) functional pseudo element are NOT standard CSS3 selectors. They're extensions to CSS selectors in Scrapy 0.20.
Edit (2017-07-20): starting from Scrapy 1.0, you can use .extract_first() instead of .extract()[0]
Link = Link1.css('span[class=title] a::attr(href)').extract_first()
Link = Link1.css('span.title a::attr(href)').extract_first()
Link = Link1.css('span.title a::attr(href)').extract_first()
you can get more infomation from this
This will do the job:
Link = Link1.css('span.title a::attr(href)').extract()
Link will have the value : https://www.example.com

Categories