Extract link and text if certain strings are found - BeautifulSoup - python

I'm trying to run beautifulSoup to extract links and text from a website (I have permission)
I run the following code to get the links and the text:
import requests
from bs4 import BeautifulSoup
url = "http://implementconsultinggroup.com/career/#/6257"
r = requests.get(url)
soup = BeautifulSoup(r.content)
links = soup.find_all("a")
for link in links:
if "career" in link.get("href"):
print "<a href='%s'>%s</a>" %(link.get("href"), link.text)
Which give me the following output:
View Position
</a>
<a href='/career/business-analyst-within-human-capital-management/'>
Business analyst within human capital management
COPENHAGEN • We are looking for an ambitious student with an interest in HR
who is passionate about working in the cross-field of people management,
business and technology
View Position
</a>
<a href='/career/management-consultants-within-strategic-workforce-planning/'>
Management consultants within strategic workforce planning
COPENHAGEN • We are looking for consultants with profound experience from
other consultancies
View Position
</a>
<a href='/career/management-consultants-within-supply-chain-strategy-
production-and-process-management/'>
Management consultants within supply chain strategy, production and process
management
MALMÖ • We are looking for talented graduates who want a career in management
consulting
Which is almost correct, however I ONLY want the positions to be returned if they have the name COPENHAGEN in the text (ie above the MALMO position should not have been returned).
The HTML Code for the site looks like this:
<div class="small-12 medium-9 columns top-lined">
<a href="/career/management-consultants-within-supply-chain-management/" class="box-link">
<h2 class="article__title--tiny" data-searchable-text="">Management consultants within supply chain management</h2>
<p class="article__longDescription" data-searchable-text="">COPENHAGEN • We are looking for bright graduates with a passion for supply chain management and supply chain planning for our planning and execution excellence team.</p>
<div class="styled-link styled-icon">
<span class="icon icon-icon">
<i class="fa fa-chevron-right"></i>
</span>
<span class="icon-text">View Position</span>
</div>
</a>
</div>

It seems you can just add another condition:
(...)
for link in links:
if "career" in link.get("href") and 'COPENHAGEN' in link.text:
print "<a href='%s'>%s</a>" %(link.get("href"), link.text)

Related

I want to scrape anchors a from container class with scrapy

<div class="breadcrumbs">
<div class="container">
Home
<span class="divider"> </span>
Special Occasion Dresses
<span class="divider"> </span>
Evening Dresses
<span class="divider"> </span>
Formal Evening Dresses
<span class="divider"> </span>
<strong>Deep V-neck Yellow Long Prom Dress Sleeveless Satin Evening Dress</strong>
</div>
I want to scrape the third anchor from container class but I am unable to scape that one I used response.css('.breadcrumbs div.container a').getall() this selector to scrape all anchors but I get only first I am beginner I need help to scrape all these achors
Pretty simple using XPath expressions.
If you want to get anchor by position:
third_url = response.xpath('//div[#class="container"]/a[3]/#href').get()
If you want to get anchor by the text of the link:
evening_dresses_url = response.xpath('//div[#class="container"]/a[.="Evening Dresses"]/#href').get()

Get text on individual links using xpath and regex

I am working on a scrapy project and we are scraping a news website.
There is a div that contains the sites tags and it may have several links.
For example:
<div class="article__tags">
<a href="/example/ops.html">
OPS
</a>
<a href="/example/covid-19.html">
Covid-19
</a>
<a href="/example/usa.html">
USA
</a>
</div>
and i am trying to get the individual tags.
I am doing it like this:
tags = html.xpath(
'//div[#class="article__tags"]/a/text()').re('(\w+)')
And in the above example i get the following tags:
OPS
USA
COVID
19
which is incorrect since covid and 19 are the same tag.
¿How can get the links texts correct?
Thank you
I managed to do it by changing it to
tags = html.xpath(
'//div[#class="article__tags"]/a/text()').extract()

Python ftech Title and pdf link from an url

I'm trying to fetch the Book Title and books embeded url link from an url, the html source content of the url looks like below, i have Just taken some little portion out of it to understand.
The when link name is here .. However the little source html portion as follows..
<section>
<div class="book row" isbn-data="1601982941">
<div class="col-lg-3">
<div class="book-cats">Artificial Intelligence</div>
<div style="width:100%;">
<img alt="Learning Deep Architectures for AI" class="book-cover" height="261" src="https://storage.googleapis.com/lds-media/images/Learning-Deep-Architectures-for-AI_2015_12_30_.width-200.png" width="200"/>
</div>
</div>
<div class="col-lg-6">
<div class="star-ratings"></div>
<h2>Learning Deep Architectures for AI</h2>
<span class="meta-auth"><b>Yoshua Bengio, 2009</b></span>
<div class="meta-auth-ttl"></div>
<p>Foundations and Trends(r) in Machine Learning.</p>
<div>
<a class="btn" href="http://www.iro.umontreal.ca/~bengioy/papers/ftml_book.pdf" rel="nofollow">View Free Book</a>
<a class="btn" href="http://amzn.to/1WePh0N" rel="nofollow">See Reviews</a>
</div>
</div>
</div>
</section>
<section>
<div class="book row" isbn-data="1496034023">
<div class="col-lg-3">
<div class="book-cats">Artificial Intelligence</div>
<div style="width:100%;">
<img alt="The LION Way: Machine Learning plus Intelligent Optimization" class="book-cover" height="261" src="https://storage.googleapis.com/lds-media/images/The-LION-Way-Learning-plus-Intelligent-Optimiz.width-200.png" width="200"/>
</div>
</div>
<div class="col-lg-6">
<div class="star-ratings"></div>
<h2>The LION Way: Machine Learning plus Intelligent Optimization</h2>
<span class="meta-auth"><b>Roberto Battiti & Mauro Brunato, 2013</b></span>
<div class="meta-auth-ttl"></div>
<p>Learning and Intelligent Optimization (LION) is the combination of learning from data and optimization applied to solve complex and dynamic problems. Learn about increasing the automation level and connecting data directly to decisions and actions.</p>
<div>
<a class="btn" href="http://www.e-booksdirectory.com/details.php?ebook=9575" rel="nofollow">View Free Book</a>
<a class="btn" href="http://amzn.to/1FcalRp" rel="nofollow">See Reviews</a>
</div>
</div>
</div>
</section>
I have tried below code:
This code just fetched the Book name or Title but still has header <h2> printing. I am looking forward to print Book name and book's pdf link as well.
#!/usr/bin/python3
from bs4 import BeautifulSoup as bs
import urllib
import urllib.request as ureq
web_res = urllib.request.urlopen("https://www.learndatasci.com/free-data-science-books/").read()
soup = bs(web_res, 'html.parser')
headers = soup.find_all(['h2'])
print(*headers, sep='\n')
#divs = soup.find_all('div')
#print(*divs, sep="\n\n")
header_1 = soup.find_all('h2', class_='book-container')
print(header_1)
output:
<h2>Artificial Intelligence A Modern Approach, 1st Edition</h2>
<h2>Learning Deep Architectures for AI</h2>
<h2>The LION Way: Machine Learning plus Intelligent Optimization</h2>
<h2>Big Data Now: 2012 Edition</h2>
<h2>Disruptive Possibilities: How Big Data Changes Everything</h2>
<h2>Real-Time Big Data Analytics: Emerging Architecture</h2>
<h2>Computer Vision</h2>
<h2>Natural Language Processing with Python</h2>
<h2>Programming Computer Vision with Python</h2>
<h2>The Elements of Data Analytic Style</h2>
<h2>A Course in Machine Learning</h2>
<h2>A First Encounter with Machine Learning</h2>
<h2>Algorithms for Reinforcement Learning</h2>
<h2>A Programmer's Guide to Data Mining</h2>
<h2>Bayesian Reasoning and Machine Learning</h2>
<h2>Data Mining Algorithms In R</h2>
<h2>Data Mining and Analysis: Fundamental Concepts and Algorithms</h2>
<h2>Data Mining: Practical Machine Learning Tools and Techniques</h2>
<h2>Data Mining with Rattle and R</h2>
<h2>Deep Learning</h2>
Desired Output:
Title: Artificial Intelligence A Modern Approach, 1st Edition
Link: http://www.cin.ufpe.br/~tfl2/artificial-intelligence-modern-approach.9780131038059.25368.pdf
Please help me understand how to achive this as I have googled around but due to lack of knowlede i'm unable to get it. as when i see the html source there are lot of div and class , so little confused to opt which class to fetch the href and h2.
The HTML is very nicely structured and you can make use of that here. The site evidently uses Bootstrap as a style scaffolding (with row and col-[size]-[gridcount] classes you can mostly ignore.
You essentially have:
a <div class="book"> per book
a column with
<div class="book-cats"> category and
image
a second column with
<div class="star-ratings"> ratings block
<h2> book title
<span class="meta-auth"> author line
<p> book description
two links with <a class=“btn" ...>
Most of those can be ignored. Both the title and your desired link are the first element of their type, so you could just use element.nested_element to grab either.
So all you have to do is
loop over all the book divs.
for every such div, take the h2 and first a elements.
For the title take the contained text of the h2
For the link take the href attribute of the a anchor link.
like this:
for book in soup.select("div.book:has(h2):has(a.btn[href])"):
title = book.h2.get_text(strip=True)
link = book.select_one("a.btn[href]")["href"]
# store or process title and link
print("Title:", title)
print("Link:", link)
I used .select_one() with a CSS selector to be a bit more specific about what link element to accept; .btn specifies the class and [href] that a href attribute must be present.
I also enhanced the book search by limiting it to divs that have both a title and at least 1 link; the :has(...) selector limits matches to those with specific child elements.
The above produces:
Title: Artificial Intelligence A Modern Approach, 1st Edition
Link: http://www.cin.ufpe.br/~tfl2/artificial-intelligence-modern-approach.9780131038059.25368.pdf
Title: Learning Deep Architectures for AI
Link: http://www.iro.umontreal.ca/~bengioy/papers/ftml_book.pdf
Title: The LION Way: Machine Learning plus Intelligent Optimization
Link: http://www.e-booksdirectory.com/details.php?ebook=9575
... etc ...
You can get the main idea from this code:
for items in zip(soup.find_all(['h2']), soup.find_all('a', class_="btn")):
h2, href = items[0].text, items[1].get('href')
print('Title:', h2)
print('Link:', href)

Scrapy XPath - Can't get text within span

I'm trying to reach the address information on a site. Here's an example of my code:
companytype_list = sel.xpath('''.//li[#class="type"]/p/text()''').extract()
headquarters_list = sel.xpath('''.//li[#class="vcard hq"]/p/span[3]/text()''').extract()
companysize_list = sel.xpath('''.//li[#class="company-size"]/p/text()''').extract()
And here's an example of how addresses are formatted on the site:
<li class="type">
<h4>Type</h4>
<p>
Privately Held
</p>
</li>
<li class="vcard hq">
<h4>Headquarters</h4>
<p class="adr" itemprop="address" itemscope itemtype="http://schema.org/PostalAddress">
<span class="street-address" itemprop="streetAddress">Kornhamnstorg 49</span>
<span class="street-address" itemprop="streetAddress"></span>
<span class="locality" itemprop="addressLocality">Stockholm,</span>
<abbr class="region" title="Stockholm" itemprop="addressRegion">Stockholm</abbr>
<span class="postal-code" itemprop="postalCode">S-11127</span>
<span class="country-name" itemprop="addressCountry">Sweden</span>
</p>
</li>
<li class="company-size">
<h4>Company Size</h4>
<p>
11-50 employees
</p>
But when I run the scrapy script I get an IndexError: list index out of range for the address (vcard hq). I've tried to rewrite the code to get the data but it does not work. The rest of the spider works fine. Am I missing something?
Your example works fine. But I guess your xpath expressions failed on another page or html part.
The problem is the use of indexes (span[3]) in the headquarters_list xpath expression. Using indexes you heavily depend on:
1. The total number of the span elements
2. On the exact order of the span elements
In general the use of indexes tend to make xpath expressions more fragile and more likely to fail. Thus, if possible, I would always avoid the use of indexes. In your example you actually take the locality of the address info. The span element can also easily be referenced by its class name which makes your expression much more robust:
//li[#class="vcard hq"]/p/span[#class='locality']/text()
Here is my testing code according to your problem description:
# -*- coding: utf-8 -*-
from scrapy.selector import Selector
html_text = """
<li class="type">
<h4>Type</h4>
<p>
Privately Held
</p>
</li>
<li class="vcard hq">
<h4>Headquarters</h4>
<p class="adr" itemprop="address" itemscope itemtype="http://schema.org/PostalAddress">
<span class="street-address" itemprop="streetAddress">Kornhamnstorg 49</span>
<span class="street-address" itemprop="streetAddress"></span>
<span class="locality" itemprop="addressLocality">Stockholm,</span>
<abbr class="region" title="Stockholm" itemprop="addressRegion">Stockholm</abbr>
<span class="postal-code" itemprop="postalCode">S-11127</span>
<span class="country-name" itemprop="addressCountry">Sweden</span>
</p>
</li>
<li class="company-size">
<h4>Company Size</h4>
<p>
11-50 employees
</p>
"""
sel = Selector(text=html_text)
companytype_list = sel.xpath(
'''.//li[#class="type"]/p/text()''').extract()
headquarters_list = sel.xpath(
'''.//li[#class="vcard hq"]/p/span[3]/text()''').extract()
companysize_list = sel.xpath(
'''.//li[#class="company-size"]/p/text()''').extract()
It doesn't raise any exception. So chances are there exist web pages with a different structure causing errors.
It's a good practice to not using index directly in xpath rules. dron22's answer gives an awesome explanation.

Parsing HTML with BeautifulSoup

(Picture is small, here is another link: http://i.imgur.com/OJC0A.png)
I'm trying to extract the text of the review at the bottom. I've tried this:
y = soup.find_all("div", style = "margin-left:0.5em;")
review = y[0].text
The problem is that there is unwanted text in the unexpanded div tags that becomes tedious to remove from the content of the review. For the life of me, I just can't figure this out. Could someone please help me?
Edit: The HTML is:
div style="margin-left:0.5em;">
<div style="margin-bottom:0.5em;"> 9 of 35 people found the following review helpful </div>
<div style="margin-bottom:0.5em;">
<div style="margin-bottom:0.5em;">
<div class="tiny" style="margin-bottom:0.5em;">
<b>
</div>
That is true. I tried it myself this morning. There's a little note on the Audible site that says "a few titles will require two credits" or something like that. A Dance with Dragons is one of those few.
The div tag above the text is as follows:
<div class="tiny" style="margin-bottom:0.5em;">
<b>
<span class="h3color tiny">This review is from: </span>
A Dance with Dragons: A Song of Ice and Fire: Book 5 (Audible Audio Edition)
</b>
</div>
That is true. I tried it myself this morning. There's a little note on the Audible site that says "a few titles will require two credits" or something like that. A Dance with Dragons is one of those few.
http://www.crummy.com/software/BeautifulSoup/bs4/doc/#strings-and-stripped-strings suggests that the .strings method is what you want - it returns a iterator of each string within the object. So if you turn that iterator into a list and take the last item, you should get what you want. For example:
$ python
>>> import bs4
>>> text = '<div style="mine"><div>unwanted</div>wanted</div>'
>>> soup = bs4.BeautifulSoup(text)
>>> soup.find_all("div", style="mine")[0].text
u'unwantedwanted'
>>> list(soup.find_all("div", style="mine")[0].strings)[-1]
u'wanted'
To get the text in the tail of div.tiny:
review = soup.find("div", "tiny").findNextSibling(text=True)
Full example:
#!/usr/bin/env python
from bs4 import BeautifulSoup
html = """<div style="margin-left:0.5em;">
<div style="margin-bottom:0.5em;">
9 of 35 people found the following review helpful </div>
<div style="margin-bottom:0.5em;">
<div style="margin-bottom:0.5em;">
<div class="tiny" style="margin-bottom:0.5em;">
<b>
<span class="h3color tiny">This review is from: </span>
<a href="http://rads.stackoverflow.com/amzn/click/B005C7QVUE">
A Dance with Dragons: A Song of Ice and Fire: Book 5 (Audible Audio Edition)</a>
</b>
</div>
That is true. I tried it myself this morning. There's a little note on the Audible site that says "a few titles will require two credits" or something like that. A Dance with Dragons is one of those few."""
soup = BeautifulSoup(html)
review = soup.find("div", "tiny").findNextSibling(text=True)
print(review)
Output
That is true. I tried it myself this morning. There's a little note on the Audible site that says "a few titles will require two credits" or something like that. A Dance with Dragons is one of those few.
Here's an equivalent lxml code that produces the same output:
import lxml.html
doc = lxml.html.fromstring(html)
print doc.find(".//div[#class='tiny']").tail

Categories