Get text on individual links using xpath and regex

Get text on individual links using xpath and regex - python

I am working on a scrapy project and we are scraping a news website.
There is a div that contains the sites tags and it may have several links.
For example:
<div class="article__tags">
<a href="/example/ops.html">
OPS
</a>
<a href="/example/covid-19.html">
Covid-19
</a>
<a href="/example/usa.html">
USA
</a>
</div>
and i am trying to get the individual tags.
I am doing it like this:
tags = html.xpath(
'//div[#class="article__tags"]/a/text()').re('(\w+)')
And in the above example i get the following tags:
OPS
USA
COVID
19
which is incorrect since covid and 19 are the same tag.
¿How can get the links texts correct?
Thank you

I managed to do it by changing it to
tags = html.xpath(
'//div[#class="article__tags"]/a/text()').extract()

Related

Python ftech Title and pdf link from an url

I'm trying to fetch the Book Title and books embeded url link from an url, the html source content of the url looks like below, i have Just taken some little portion out of it to understand.
The when link name is here .. However the little source html portion as follows..
<section>
<div class="book row" isbn-data="1601982941">
<div class="col-lg-3">
<div class="book-cats">Artificial Intelligence</div>
<div style="width:100%;">
<img alt="Learning Deep Architectures for AI" class="book-cover" height="261" src="https://storage.googleapis.com/lds-media/images/Learning-Deep-Architectures-for-AI_2015_12_30_.width-200.png" width="200"/>
</div>
</div>
<div class="col-lg-6">
<div class="star-ratings"></div>
<h2>Learning Deep Architectures for AI</h2>
<span class="meta-auth"><b>Yoshua Bengio, 2009</b></span>
<div class="meta-auth-ttl"></div>
<p>Foundations and Trends(r) in Machine Learning.</p>
<div>
<a class="btn" href="http://www.iro.umontreal.ca/~bengioy/papers/ftml_book.pdf" rel="nofollow">View Free Book</a>
<a class="btn" href="http://amzn.to/1WePh0N" rel="nofollow">See Reviews</a>
</div>
</div>
</div>
</section>
<section>
<div class="book row" isbn-data="1496034023">
<div class="col-lg-3">
<div class="book-cats">Artificial Intelligence</div>
<div style="width:100%;">
<img alt="The LION Way: Machine Learning plus Intelligent Optimization" class="book-cover" height="261" src="https://storage.googleapis.com/lds-media/images/The-LION-Way-Learning-plus-Intelligent-Optimiz.width-200.png" width="200"/>
</div>
</div>
<div class="col-lg-6">
<div class="star-ratings"></div>
<h2>The LION Way: Machine Learning plus Intelligent Optimization</h2>
<span class="meta-auth"><b>Roberto Battiti & Mauro Brunato, 2013</b></span>
<div class="meta-auth-ttl"></div>
<p>Learning and Intelligent Optimization (LION) is the combination of learning from data and optimization applied to solve complex and dynamic problems. Learn about increasing the automation level and connecting data directly to decisions and actions.</p>
<div>
<a class="btn" href="http://www.e-booksdirectory.com/details.php?ebook=9575" rel="nofollow">View Free Book</a>
<a class="btn" href="http://amzn.to/1FcalRp" rel="nofollow">See Reviews</a>
</div>
</div>
</div>
</section>
I have tried below code:
This code just fetched the Book name or Title but still has header <h2> printing. I am looking forward to print Book name and book's pdf link as well.
#!/usr/bin/python3
from bs4 import BeautifulSoup as bs
import urllib
import urllib.request as ureq
web_res = urllib.request.urlopen("https://www.learndatasci.com/free-data-science-books/").read()
soup = bs(web_res, 'html.parser')
headers = soup.find_all(['h2'])
print(*headers, sep='\n')
#divs = soup.find_all('div')
#print(*divs, sep="\n\n")
header_1 = soup.find_all('h2', class_='book-container')
print(header_1)
output:
<h2>Artificial Intelligence A Modern Approach, 1st Edition</h2>
<h2>Learning Deep Architectures for AI</h2>
<h2>The LION Way: Machine Learning plus Intelligent Optimization</h2>
<h2>Big Data Now: 2012 Edition</h2>
<h2>Disruptive Possibilities: How Big Data Changes Everything</h2>
<h2>Real-Time Big Data Analytics: Emerging Architecture</h2>
<h2>Computer Vision</h2>
<h2>Natural Language Processing with Python</h2>
<h2>Programming Computer Vision with Python</h2>
<h2>The Elements of Data Analytic Style</h2>
<h2>A Course in Machine Learning</h2>
<h2>A First Encounter with Machine Learning</h2>
<h2>Algorithms for Reinforcement Learning</h2>
<h2>A Programmer's Guide to Data Mining</h2>
<h2>Bayesian Reasoning and Machine Learning</h2>
<h2>Data Mining Algorithms In R</h2>
<h2>Data Mining and Analysis: Fundamental Concepts and Algorithms</h2>
<h2>Data Mining: Practical Machine Learning Tools and Techniques</h2>
<h2>Data Mining with Rattle and R</h2>
<h2>Deep Learning</h2>
Desired Output:
Title: Artificial Intelligence A Modern Approach, 1st Edition
Link: http://www.cin.ufpe.br/~tfl2/artificial-intelligence-modern-approach.9780131038059.25368.pdf
Please help me understand how to achive this as I have googled around but due to lack of knowlede i'm unable to get it. as when i see the html source there are lot of div and class , so little confused to opt which class to fetch the href and h2.

The HTML is very nicely structured and you can make use of that here. The site evidently uses Bootstrap as a style scaffolding (with row and col-[size]-[gridcount] classes you can mostly ignore.
You essentially have:
a <div class="book"> per book
a column with
<div class="book-cats"> category and
image
a second column with
<div class="star-ratings"> ratings block
<h2> book title
<span class="meta-auth"> author line
<p> book description
two links with <a class=“btn" ...>
Most of those can be ignored. Both the title and your desired link are the first element of their type, so you could just use element.nested_element to grab either.
So all you have to do is
loop over all the book divs.
for every such div, take the h2 and first a elements.
For the title take the contained text of the h2
For the link take the href attribute of the a anchor link.
like this:
for book in soup.select("div.book:has(h2):has(a.btn[href])"):
title = book.h2.get_text(strip=True)
link = book.select_one("a.btn[href]")["href"]
# store or process title and link
print("Title:", title)
print("Link:", link)
I used .select_one() with a CSS selector to be a bit more specific about what link element to accept; .btn specifies the class and [href] that a href attribute must be present.
I also enhanced the book search by limiting it to divs that have both a title and at least 1 link; the :has(...) selector limits matches to those with specific child elements.
The above produces:
Title: Artificial Intelligence A Modern Approach, 1st Edition
Link: http://www.cin.ufpe.br/~tfl2/artificial-intelligence-modern-approach.9780131038059.25368.pdf
Title: Learning Deep Architectures for AI
Link: http://www.iro.umontreal.ca/~bengioy/papers/ftml_book.pdf
Title: The LION Way: Machine Learning plus Intelligent Optimization
Link: http://www.e-booksdirectory.com/details.php?ebook=9575
... etc ...

You can get the main idea from this code:
for items in zip(soup.find_all(['h2']), soup.find_all('a', class_="btn")):
h2, href = items[0].text, items[1].get('href')
print('Title:', h2)
print('Link:', href)

Extract link and text if certain strings are found - BeautifulSoup

I'm trying to run beautifulSoup to extract links and text from a website (I have permission)
I run the following code to get the links and the text:
import requests
from bs4 import BeautifulSoup
url = "http://implementconsultinggroup.com/career/#/6257"
r = requests.get(url)
soup = BeautifulSoup(r.content)
links = soup.find_all("a")
for link in links:
if "career" in link.get("href"):
print "<a href='%s'>%s</a>" %(link.get("href"), link.text)
Which give me the following output:
View Position
</a>
<a href='/career/business-analyst-within-human-capital-management/'>
Business analyst within human capital management
COPENHAGEN • We are looking for an ambitious student with an interest in HR
who is passionate about working in the cross-field of people management,
business and technology
View Position
</a>
<a href='/career/management-consultants-within-strategic-workforce-planning/'>
Management consultants within strategic workforce planning
COPENHAGEN • We are looking for consultants with profound experience from
other consultancies
View Position
</a>
<a href='/career/management-consultants-within-supply-chain-strategy-
production-and-process-management/'>
Management consultants within supply chain strategy, production and process
management
MALMÖ • We are looking for talented graduates who want a career in management
consulting
Which is almost correct, however I ONLY want the positions to be returned if they have the name COPENHAGEN in the text (ie above the MALMO position should not have been returned).
The HTML Code for the site looks like this:
<div class="small-12 medium-9 columns top-lined">
<a href="/career/management-consultants-within-supply-chain-management/" class="box-link">
<h2 class="article__title--tiny" data-searchable-text="">Management consultants within supply chain management</h2>
<p class="article__longDescription" data-searchable-text="">COPENHAGEN • We are looking for bright graduates with a passion for supply chain management and supply chain planning for our planning and execution excellence team.</p>
<div class="styled-link styled-icon">
<span class="icon icon-icon">
<i class="fa fa-chevron-right"></i>
</span>
<span class="icon-text">View Position</span>
</div>
</a>
</div>

It seems you can just add another condition:
(...)
for link in links:
if "career" in link.get("href") and 'COPENHAGEN' in link.text:
print "<a href='%s'>%s</a>" %(link.get("href"), link.text)

Python scraping xpath get <a> with specific <span>

I'm using Scrapy to get some data from a website.
I have the following list of links:
<li class="m-pagination__item">
10
</li>
<li class="m-pagination__item">
<a href="?isin=IT0000072618&lang=it&page=1">
<span class="m-icon -pagination-right"></span>
</a>
I want to extract the href attribute only of the 'a' element that contains the span class="m-icon -pagination-right".
I've been looking for some examples of xpath but I'm not an expert of xpath and I couldn't find a solution.
Thanks.

//a[span/#class = 'm-icon -pagination-right']/#href

With a Scrapy response:
response.css('span.m-icon').xpath('../#href')

Using XPath Following to get element from XML

I have an XML like the following
<li class="expandSubItem">
<span class="expandSubLink">Popular Neighborhoods</span>
<ul class="secondSubNav" style="top:-0.125em;">
<li class="subItem">
<a class="subLink" href="/Hotels-g187147-zfn7236765-Paris_Ile_de_France-Hotels.html">Quartier Latin Hotels</a>
</li>
</ul>
</li>
<li class="expandSubItem">
<span class="expandSubLink">Popular Paris Categories</span>
<ul class="secondSubNav" style="top:-0.125em;">
<li class="subItem">
<a class="subLink" href="/HotelsList-Paris-Cheap-Hotels-zfp10420.html">Paris Cheap Hotels</a>
</li>
</ul>
</li>
I want to get all links under "Popular Paris Categories". I used something like this //li//a/#href/following::span[text()='Popular Singapore Categories'], but it gave no results. Any idea how to get the correct result? Here is the snippet of the python code that I wrote.
t_url = 'https://www.tripadvisor.com/Tourism-g187147-Paris_Ile_de_France-Vacations.html'
page = requests.get(t_url, timeout=30)
tree = html.fromstring(page.content)
links = tree.xpath('//li[span="Popular Paris Categories"]//a/#href')
print links

This is one possible way :
//li[normalize-space(span)="Popular Paris Categories"]//a/#href
Notice how normalize-space() is used to remove trailing space from the span content. This is the reason why the XPath I suggested initially in the comment didn't work for your actual HTML.

Something like this perhaps
//span[text()='Popular Paris Categories']/following-sibling::ul//a/#href

Parsing HTML with BeautifulSoup

(Picture is small, here is another link: http://i.imgur.com/OJC0A.png)
I'm trying to extract the text of the review at the bottom. I've tried this:
y = soup.find_all("div", style = "margin-left:0.5em;")
review = y[0].text
The problem is that there is unwanted text in the unexpanded div tags that becomes tedious to remove from the content of the review. For the life of me, I just can't figure this out. Could someone please help me?
Edit: The HTML is:
div style="margin-left:0.5em;">
<div style="margin-bottom:0.5em;"> 9 of 35 people found the following review helpful </div>
<div style="margin-bottom:0.5em;">
<div style="margin-bottom:0.5em;">
<div class="tiny" style="margin-bottom:0.5em;">
<b>
</div>
That is true. I tried it myself this morning. There's a little note on the Audible site that says "a few titles will require two credits" or something like that. A Dance with Dragons is one of those few.
The div tag above the text is as follows:
<div class="tiny" style="margin-bottom:0.5em;">
<b>
<span class="h3color tiny">This review is from: </span>
A Dance with Dragons: A Song of Ice and Fire: Book 5 (Audible Audio Edition)
</b>
</div>
That is true. I tried it myself this morning. There's a little note on the Audible site that says "a few titles will require two credits" or something like that. A Dance with Dragons is one of those few.

http://www.crummy.com/software/BeautifulSoup/bs4/doc/#strings-and-stripped-strings suggests that the .strings method is what you want - it returns a iterator of each string within the object. So if you turn that iterator into a list and take the last item, you should get what you want. For example:
$ python
>>> import bs4
>>> text = '<div style="mine"><div>unwanted</div>wanted</div>'
>>> soup = bs4.BeautifulSoup(text)
>>> soup.find_all("div", style="mine")[0].text
u'unwantedwanted'
>>> list(soup.find_all("div", style="mine")[0].strings)[-1]
u'wanted'

To get the text in the tail of div.tiny:
review = soup.find("div", "tiny").findNextSibling(text=True)
Full example:
#!/usr/bin/env python
from bs4 import BeautifulSoup
html = """<div style="margin-left:0.5em;">
<div style="margin-bottom:0.5em;">
9 of 35 people found the following review helpful </div>
<div style="margin-bottom:0.5em;">
<div style="margin-bottom:0.5em;">
<div class="tiny" style="margin-bottom:0.5em;">
<b>
<span class="h3color tiny">This review is from: </span>
<a href="http://rads.stackoverflow.com/amzn/click/B005C7QVUE">
A Dance with Dragons: A Song of Ice and Fire: Book 5 (Audible Audio Edition)</a>
</b>
</div>
That is true. I tried it myself this morning. There's a little note on the Audible site that says "a few titles will require two credits" or something like that. A Dance with Dragons is one of those few."""
soup = BeautifulSoup(html)
review = soup.find("div", "tiny").findNextSibling(text=True)
print(review)
Output
That is true. I tried it myself this morning. There's a little note on the Audible site that says "a few titles will require two credits" or something like that. A Dance with Dragons is one of those few.
Here's an equivalent lxml code that produces the same output:
import lxml.html
doc = lxml.html.fromstring(html)
print doc.find(".//div[#class='tiny']").tail

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Get text on individual links using xpath and regex - python

I managed to do it by changing it to tags = html.xpath( '//div[#class="article__tags"]/a/text()').extract()

Related

Python ftech Title and pdf link from an url

Extract link and text if certain strings are found - BeautifulSoup

Python scraping xpath get <a> with specific <span>

Using XPath Following to get element from XML

Parsing HTML with BeautifulSoup

Categories

Resources