How to extract href url from html anchor using lxml? - python

I try to extract the next page href string using lxml.
For example I try to extract the "/review/bulb.co.uk?b=MTYxOTg5MDE1OTAwMHw2MDhkOGZlZmY5ZjQ4NzA4ZTA4MWI2Mzk" from the html in the following example:
<nav rel="nav" class="pagination-container AjaxPager">
<a href="/review/bulb.co.uk?b=MTYxOTg5MDE1OTAwMHw2MDhkOGZlZmY5ZjQ4NzA4ZTA4MWI2Mzk" data-page-number="next-page" class="button button--primary next-page" rel="next" data-track-link="{'target': 'Company profile', 'name': 'navigation', 'navigationType': 'next'}">
Next page
</a>
</nav>
I have tried the following but it returns a list not the string that I am looking for:
import requests
import lxml.html as html
URL = https://uk.trustpilot.com/review/bulb.co.uk
page = requests.get(URL)
tree = html.fromstring(page.content)
href = tree.xpath('//a/#href')
Any idea what I am doing wrong?

Making this change to your code
href = tree.xpath('//a[#class="button button--primary next-page"]/#href')
href[0]
Gives me this output:
'/review/bulb.co.uk?b=MTYxOTk1ODMxMzAwMHw2MDhlOWEyOWY5ZjQ4NzA4ZTA4MjMxNTE'
which is close to the output in your question (its value may change dynamically).

Related

How to scrape aria-label text in python?

I want scrape players name list from website, but names are on labels. I don't know how to scrape text on labels.
Here is the link
https://athletics.baruch.cuny.edu/sports/mens-swimming-and-diving/roster
For example, from html we have
How to scrape text from labels?
<div class="sidearm-roster-player-image column">
<a data-bind="click: function() { return true; }, clickBubble: false" href="/sports/mens-swimming-and-diving/roster/gregory-becker/3555" aria-label="Gregory Becker - View Full Bio" title="View Full Bio">
<img class="lazyload" data-src="/images/2018/10/19/GREGORY_BECKER.jpg?width=80" alt="GREGORY BECKER">
</a>
</div>
You can use .get() method in BeautifulSoup. First select your element in elem or any other variable using any selector or find/find_all. Then try:
print(elem.get('aria-label'))
Below is the code that will help you to extract name from the a tag
from bs4 import BeautifulSoup
with open("<path-to-html-file>") as fp:
soup = BeautifulSoup(fp, 'html.parser') #parse the html
tags = soup.find_all('a') # get all the a tag
for tag in tags:
print(tag.get('aria-label')) #get the required text

Scrapy: how to get links to users?

I'm trying to get links to group members:
response.css('.text--ellipsisOneLine::attr(href)').getall()
Why isn't this working?
html:
<div class="flex flex--row flex--noGutters flex--alignCenter">
<div class="flex-item _memberItem-module_name__BSx8i">
<a href="/ru-RU/Connect-IT-Meetup-in-Chisinau/members/280162178/profile/?returnPage=1">
<h4 class="text--bold text--ellipsisOneLine">Liviu Cernei</h4>
</a>
</div>
</div>
Your selector isn't working because you are looking for a attribute (href) that this element doesn't have.
response.css('.text--ellipsisOneLine::attr(href)').getall()
This selector is searching for href inside elements of class text--ellipsisOneLine. In your HTML snippet that class matches only with this:
<h4 class="text--bold text--ellipsisOneLine">Liviu Cernei</h4>
As you can see, there is no href attribute. Now, if you want the text between this h4 element you need to use ::text pseudo-element.
response.css('.text--ellipsisOneLine::text').getall()
Read more here.
I realize that this isn't scrapy, but personally for web scraping I use the requests module and BeautifulSoup4, and the following code snippet will get you a list of users with the aforementioned modules:
import requests
from bs4 import BeautifulSoup
response = requests.get('https://www.meetup.com/ru-RU/Connect-IT-Meetup-in-Chisinau/members/')
if response.status_code == 200:
html_doc = response.text
html_source = BeautifulSoup(html_doc, 'html.parser')
users = html_source.findAll('h4')
for user in users:
print(user.text)
css:
response.css('.member-item .flex--alignCenter a::attr(href)').getall()

Python / Beautifulsoup: HTML Path to the current element

For a class project, I'm working on extracting all links on a webpage. This is what I have so far.
from bs4 import BeautifulSoup, SoupStrainer
with open("input.htm") as inputFile:
soup = BeautifulSoup(inputFile)
outputFile=open('output.txt', 'w')
for link in soup.find_all('a', href=True):
outputFile.write(str(link)+'\n')
outputFile.close()
This works very well.
Here's the complication: for every <a> element, my project requires me to know the entire "tree structure" to the current link. In other words, I'd like to know all the precendent elements starting with the the <body> element. And the class and id along the way.
Like the navigation page on Windows explorer. Or the navigation panel on many browsers' element inspection tool.
For example, if you look at the Bible page on Wikipedia and a link to the Wikipedia page for the Talmud, the following "path" is what I'm looking for.
<body class="mediawiki ...>
<div id="content" class="mw-body" role="main">
<div id="bodyContent" class="mw-body-content">
<div id="mw-content-text" ...>
<div class="mw-parser-output">
<div role="navigation" ...>
<table class="nowraplinks ...>
<tbody>
<td class="navbox-list ...>
<div style="padding:0em 0.25em">
<ul>
<li>
<a href="/wiki/Talmud"
Thanks a bunch.
-Maureen
Try this code:
soup = BeautifulSoup(inputFile, 'html.parser')
Or use lxml:
soup = BeautifulSoup(inputFile, 'lxml')
If it is not installed:
pip install lxml
Here is a solution I just wrote. It works by finding the element, then navigating up the tree by the elements parent. I parse just the opening tag and add it to a list. Reverse the list at the end. Finally we end up with a list that resembles the tree you requested.
I have written it for one element, you can modify it to work with your find_all
from bs4 import BeautifulSoup
import requests
page = requests.get("https://en.wikipedia.org/wiki/Bible")
soup = BeautifulSoup(page.text, 'html.parser')
tree = []
hrefElement = soup.find('a', href=True)
hrefString = str(hrefElement).split(">")[0] + ">"
tree.append(hrefString)
hrefParent = hrefElement.find_parent()
while (hrefParent.name != "html"):
hrefString = str(hrefParent).split(">")[0] + ">"
tree.append(hrefString)
hrefParent = hrefParent.find_parent()
tree.reverse()
print(tree)

Exception handling when the input link doesn't have the appropriate form

for instance, i have a list of links like this:
linklists = ['www.right1.com', www.right2.com', 'www.wrong.com', 'www.right3.com']
and the form of each right1,right2 and right3's html is:
<html>
<p>
hi
</p>
<strong>
hello
</strong>
</html>
and the form of www.wrong.com html is(actual html is much more complicated):
<html>
<p>
hi
</p>
</html>
and i'm using a code like this:
from BeautifulSoup import BeautifulSoup
stronglist=[]
for httplink in linklists:
url = httplink
page = urllib2.urlopen(url)
html = page.read()
soup = BeautifulSoup(html)
findstrong = soup.findAll("strong")
findstrong = str(findstrong)
findstrong = re.sub(r'\[|\]|\s*<[^>]*>\s*', '', findstrong) #remove tag
stronglist.append(findstrong)
what i want to do is:
get through html links from the list 'linklists'
find data between <strong>
add them to list 'stronglist'
but the problem is:
there is a wrong link (www.wrong.com) that has no .
then the code says error...
what i want is an exception handling(or something else) that if the link doesn't have 'strong' field(it has error), i want the code to add the string 'null' to the stronglist since it can't get data from the link.
i have been using 'if to solve this but it's a bit hard for me
any suggestions?
There is no need to use exception handling. Just identify when the findAll method returns an empty list and deal with that.
from BeautifulSoup import BeautifulSoup
strong_list=[]
for url in link_list:
soup = BeautifulSoup(urllib2.urlopen(url).read())
strong_tags = soup.findAll("strong")
if not strong_tags:
strong_list.append('null')
continue
for strong_tag in strong_tags:
strong_list.append(strong_tag.text)

Beautiful Soup - Cannot find the tags

The page is: http://item.taobao.com/item.htm?id=13015989524
you can see its source code.
In its source code the following code exists
<a href="http://item.taobao.com/item.htm?id=13015989524" target="_blank">
But when I use BeautifulSoup to read the source code and execute the following
soup.findAll('a', href="http://item.taobao.com/item.htm?id=13015989524")
It returns [] empty. What does it return '[]'?
As far as I can see, the <a> tag you are trying to find is inside a <textarea> tag. BS does not parse the contents of <textarea> as HTML, and rightly so since <textarea> should not contain HTML. In short, that page is doing something sketchy.
If you really need to get that, you might "cheat" and parse the contents of <textarea> again and search within them:
import urllib
from BeautifulSoup import BeautifulSoup as BS
soup = BS(urllib.urlopen("http://item.taobao.com/item.htm?id=13015989524"))
a = []
for textarea in soup.findAll("textarea"):
textsoup = BS(textarea.text) # parse the contents as html
a.extend(textsoup.findAll("a", attrs={"href":"http://item.taobao.com/item.htm?id=13015989524"}))
for tag in a:
print tag
# outputs
# <a href="http://item.taobao.com/item.htm?id=13015989524" target="_blank"><img ...
# <a href="http://item.taobao.com/item.htm?id=13015989524" title="901 ...
Use a dictionary to store the attribute:
soup.findAll('a', {
'href': "http://item.taobao.com/item.htm?id=13015989524"
})

Categories