Python ftech Title and pdf link from an url - python

I'm trying to fetch the Book Title and books embeded url link from an url, the html source content of the url looks like below, i have Just taken some little portion out of it to understand.
The when link name is here .. However the little source html portion as follows..
<section>
<div class="book row" isbn-data="1601982941">
<div class="col-lg-3">
<div class="book-cats">Artificial Intelligence</div>
<div style="width:100%;">
<img alt="Learning Deep Architectures for AI" class="book-cover" height="261" src="https://storage.googleapis.com/lds-media/images/Learning-Deep-Architectures-for-AI_2015_12_30_.width-200.png" width="200"/>
</div>
</div>
<div class="col-lg-6">
<div class="star-ratings"></div>
<h2>Learning Deep Architectures for AI</h2>
<span class="meta-auth"><b>Yoshua Bengio, 2009</b></span>
<div class="meta-auth-ttl"></div>
<p>Foundations and Trends(r) in Machine Learning.</p>
<div>
<a class="btn" href="http://www.iro.umontreal.ca/~bengioy/papers/ftml_book.pdf" rel="nofollow">View Free Book</a>
<a class="btn" href="http://amzn.to/1WePh0N" rel="nofollow">See Reviews</a>
</div>
</div>
</div>
</section>
<section>
<div class="book row" isbn-data="1496034023">
<div class="col-lg-3">
<div class="book-cats">Artificial Intelligence</div>
<div style="width:100%;">
<img alt="The LION Way: Machine Learning plus Intelligent Optimization" class="book-cover" height="261" src="https://storage.googleapis.com/lds-media/images/The-LION-Way-Learning-plus-Intelligent-Optimiz.width-200.png" width="200"/>
</div>
</div>
<div class="col-lg-6">
<div class="star-ratings"></div>
<h2>The LION Way: Machine Learning plus Intelligent Optimization</h2>
<span class="meta-auth"><b>Roberto Battiti & Mauro Brunato, 2013</b></span>
<div class="meta-auth-ttl"></div>
<p>Learning and Intelligent Optimization (LION) is the combination of learning from data and optimization applied to solve complex and dynamic problems. Learn about increasing the automation level and connecting data directly to decisions and actions.</p>
<div>
<a class="btn" href="http://www.e-booksdirectory.com/details.php?ebook=9575" rel="nofollow">View Free Book</a>
<a class="btn" href="http://amzn.to/1FcalRp" rel="nofollow">See Reviews</a>
</div>
</div>
</div>
</section>
I have tried below code:
This code just fetched the Book name or Title but still has header <h2> printing. I am looking forward to print Book name and book's pdf link as well.
#!/usr/bin/python3
from bs4 import BeautifulSoup as bs
import urllib
import urllib.request as ureq
web_res = urllib.request.urlopen("https://www.learndatasci.com/free-data-science-books/").read()
soup = bs(web_res, 'html.parser')
headers = soup.find_all(['h2'])
print(*headers, sep='\n')
#divs = soup.find_all('div')
#print(*divs, sep="\n\n")
header_1 = soup.find_all('h2', class_='book-container')
print(header_1)
output:
<h2>Artificial Intelligence A Modern Approach, 1st Edition</h2>
<h2>Learning Deep Architectures for AI</h2>
<h2>The LION Way: Machine Learning plus Intelligent Optimization</h2>
<h2>Big Data Now: 2012 Edition</h2>
<h2>Disruptive Possibilities: How Big Data Changes Everything</h2>
<h2>Real-Time Big Data Analytics: Emerging Architecture</h2>
<h2>Computer Vision</h2>
<h2>Natural Language Processing with Python</h2>
<h2>Programming Computer Vision with Python</h2>
<h2>The Elements of Data Analytic Style</h2>
<h2>A Course in Machine Learning</h2>
<h2>A First Encounter with Machine Learning</h2>
<h2>Algorithms for Reinforcement Learning</h2>
<h2>A Programmer's Guide to Data Mining</h2>
<h2>Bayesian Reasoning and Machine Learning</h2>
<h2>Data Mining Algorithms In R</h2>
<h2>Data Mining and Analysis: Fundamental Concepts and Algorithms</h2>
<h2>Data Mining: Practical Machine Learning Tools and Techniques</h2>
<h2>Data Mining with Rattle and R</h2>
<h2>Deep Learning</h2>
Desired Output:
Title: Artificial Intelligence A Modern Approach, 1st Edition
Link: http://www.cin.ufpe.br/~tfl2/artificial-intelligence-modern-approach.9780131038059.25368.pdf
Please help me understand how to achive this as I have googled around but due to lack of knowlede i'm unable to get it. as when i see the html source there are lot of div and class , so little confused to opt which class to fetch the href and h2.

The HTML is very nicely structured and you can make use of that here. The site evidently uses Bootstrap as a style scaffolding (with row and col-[size]-[gridcount] classes you can mostly ignore.
You essentially have:
a <div class="book"> per book
a column with
<div class="book-cats"> category and
image
a second column with
<div class="star-ratings"> ratings block
<h2> book title
<span class="meta-auth"> author line
<p> book description
two links with <a class=“btn" ...>
Most of those can be ignored. Both the title and your desired link are the first element of their type, so you could just use element.nested_element to grab either.
So all you have to do is
loop over all the book divs.
for every such div, take the h2 and first a elements.
For the title take the contained text of the h2
For the link take the href attribute of the a anchor link.
like this:
for book in soup.select("div.book:has(h2):has(a.btn[href])"):
title = book.h2.get_text(strip=True)
link = book.select_one("a.btn[href]")["href"]
# store or process title and link
print("Title:", title)
print("Link:", link)
I used .select_one() with a CSS selector to be a bit more specific about what link element to accept; .btn specifies the class and [href] that a href attribute must be present.
I also enhanced the book search by limiting it to divs that have both a title and at least 1 link; the :has(...) selector limits matches to those with specific child elements.
The above produces:
Title: Artificial Intelligence A Modern Approach, 1st Edition
Link: http://www.cin.ufpe.br/~tfl2/artificial-intelligence-modern-approach.9780131038059.25368.pdf
Title: Learning Deep Architectures for AI
Link: http://www.iro.umontreal.ca/~bengioy/papers/ftml_book.pdf
Title: The LION Way: Machine Learning plus Intelligent Optimization
Link: http://www.e-booksdirectory.com/details.php?ebook=9575
... etc ...

You can get the main idea from this code:
for items in zip(soup.find_all(['h2']), soup.find_all('a', class_="btn")):
h2, href = items[0].text, items[1].get('href')
print('Title:', h2)
print('Link:', href)

Related

Anything similar to "until" in CSS selector?

I would like to get movie names available between "tracked_by" id to "buzz_off" id. I have already created a selector which can grab names after "tracked_by" id. However, my intention is to let the script do the parsing UNTIL it finds "buzz_off" id. The elements within which the names are:
html = '''
<div class="list">
<a id="allow" name="allow"></a>
<h4 class="cluster">Allow</h4>
<div class="base min">Sally</div>
<div class="base max">Blood Diamond</div>
<a id="tracked_by" name="tracked_by"></a>
<h4 class="cluster">Tracked by</h4>
<div class="base min">Gladiator</div>
<div class="base max">Troy</div>
<a id="buzz_off" name="buzz_off"></a>
<h4 class="cluster">Buzz-off</h4>
<div class="base min">Heat</div>
<div class="base max">Matrix</div>
</div>
'''
from lxml import html as htm
root = htm.fromstring(html)
for item in root.cssselect("a#tracked_by ~ div.base a"):
print(item.text)
The selector I've tried with (also mentioned in the above script):
a#tracked_by ~ div.base a
Results I'm having:
Gladiator
Troy
Heat
Matrix
Results I would like to get:
Gladiator
Troy
Btw, I would like to parse the names using this selector not to style.
this is a reference for css selectors. As you can see, it doesn't have any form of logic, as it is not a programming language. You'd have to use a while not loop in python and handle each element one at a time, or append them to a list.

Extract link and text if certain strings are found - BeautifulSoup

I'm trying to run beautifulSoup to extract links and text from a website (I have permission)
I run the following code to get the links and the text:
import requests
from bs4 import BeautifulSoup
url = "http://implementconsultinggroup.com/career/#/6257"
r = requests.get(url)
soup = BeautifulSoup(r.content)
links = soup.find_all("a")
for link in links:
if "career" in link.get("href"):
print "<a href='%s'>%s</a>" %(link.get("href"), link.text)
Which give me the following output:
View Position
</a>
<a href='/career/business-analyst-within-human-capital-management/'>
Business analyst within human capital management
COPENHAGEN • We are looking for an ambitious student with an interest in HR
who is passionate about working in the cross-field of people management,
business and technology
View Position
</a>
<a href='/career/management-consultants-within-strategic-workforce-planning/'>
Management consultants within strategic workforce planning
COPENHAGEN • We are looking for consultants with profound experience from
other consultancies
View Position
</a>
<a href='/career/management-consultants-within-supply-chain-strategy-
production-and-process-management/'>
Management consultants within supply chain strategy, production and process
management
MALMÖ • We are looking for talented graduates who want a career in management
consulting
Which is almost correct, however I ONLY want the positions to be returned if they have the name COPENHAGEN in the text (ie above the MALMO position should not have been returned).
The HTML Code for the site looks like this:
<div class="small-12 medium-9 columns top-lined">
<a href="/career/management-consultants-within-supply-chain-management/" class="box-link">
<h2 class="article__title--tiny" data-searchable-text="">Management consultants within supply chain management</h2>
<p class="article__longDescription" data-searchable-text="">COPENHAGEN • We are looking for bright graduates with a passion for supply chain management and supply chain planning for our planning and execution excellence team.</p>
<div class="styled-link styled-icon">
<span class="icon icon-icon">
<i class="fa fa-chevron-right"></i>
</span>
<span class="icon-text">View Position</span>
</div>
</a>
</div>
It seems you can just add another condition:
(...)
for link in links:
if "career" in link.get("href") and 'COPENHAGEN' in link.text:
print "<a href='%s'>%s</a>" %(link.get("href"), link.text)

How to scrape tags that appear within a script

My intention is to scrape the names of the top-selling products on Ali-Express.
I'm using the Requests library alongside Beautiful Soup to accomplish this.
# Remember to import BeautifulSoup, requests and pprint
url = "https://bestselling.aliexpress.com/en?spm=2114.11010108.21.3.qyEJ5m"
soup = bs(req.get(url).text, 'html.parser')
#pp.pprint(soup) Verify that the page has been found
all_items = soup.find_all('li',class_= 'top10-item')
pp.pprint(all_items)
# []
However this returns an empty list, indicating that soup_find_all() did not find any tags fitting that criteria.
Inspect Element in Chrome displays the list items as such
.
However in source code (ul class = "top10-items") contains a script, which seems to iterate through each list item (I'm not familiar with HTML).
<div class="container">
<div class="top10-header"><span class="title">TOP SELLING</span> <span class="sub-title">This week's most popular products</span></div>
<ul class="top10-items loading" id="bestselling-top10">
</ul>
<script class="X-template-top10" type="text/mustache-template">
{{#topList}}
<li class="top10-item">
<div class="rank-orders">
<span class="rank">{{rank}}</span>
<span class="orders">{{productOrderNum}}</span>
</div>
<div class="img-wrap">
<a href="{{productDetailUrl}}" target="_blank">
<img src="{{productImgUrl}}" alt="{{productName}}">
</a>
</div>
<a class="item-desc" href="{{productDetailUrl}}" target="_blank">{{productName}}</a>
<p class="item-price">
<span class="price">US ${{productMinPrice}}</span>
<span class="uint">/ {{productUnitType}}</span>
</p>
</li>
{{/topList}}</script>
</div>
</div>
So this probably explains why soup.find_all() doesn't find the "li" tag.
My question is: How can I extract the item names from the script using Beautiful soup?

IndexError: list index out of range while using bs4

This I the Link where I am trying to fetch data flipkart
and the part of code :
<div class="toolbar-wrap line section">
<div class="ratings-reviews-wrap">
<div itemprop="aggregateRating" itemscope="" itemtype="http://schema.org/AggregateRating" class="ratings-reviews line omniture-field">
<div class="ratings">
<meta itemprop="ratingValue" content="1">
<div class="fk-stars" title="1 stars">
<span class="unfilled">★★★★★</span>
<span class="rating filled" style="width:20%">
★★★★★
</span>
</div>
<div class="count">
<span itemprop="ratingCount">2</span>
</div>
</div>
</div>
</div>
</div>
here I have to fetch 1 star from title= 1 star and 2 from <span itemprop="ratingCount">2</span>
I try the following code
x = link_soup.find_all("div",class_='fk-stars')[0].get('title')
print x, " product_star"
y = link_soup.find_all("span",itemprop="ratingCount")[0].string.strip()
print y
but It give the
IndexError: list index out of range
The content that you see in the browser is not actually present in the raw HTML that is retrieved from this URL.
When loaded with a browser, the page executes AJAX calls to load additional content, which is then dynamically inserted into the page. One of the calls gets the ratings info that you are after. Specifically this URL is the one that contains the HTML that is inserted as the "action bar".
But if you retrieve the main page using Python, e.g. with requests, urllib et. al., the dynamic content is not loaded and that is why BeautifulSoup can't find the tags.
You could analyse the main page to find the actual link, retrieve that, and then run it through BeautifulSoup. The link looks like it begins with /p/pv1/spotList1/spot1/actionBar so that, or perhaps actionBar is sufficient to locate the actual link.
Or you could use selenium to load the page and then grab and process the rendered HTML.

Scrapy XPath - Can't get text within span

I'm trying to reach the address information on a site. Here's an example of my code:
companytype_list = sel.xpath('''.//li[#class="type"]/p/text()''').extract()
headquarters_list = sel.xpath('''.//li[#class="vcard hq"]/p/span[3]/text()''').extract()
companysize_list = sel.xpath('''.//li[#class="company-size"]/p/text()''').extract()
And here's an example of how addresses are formatted on the site:
<li class="type">
<h4>Type</h4>
<p>
Privately Held
</p>
</li>
<li class="vcard hq">
<h4>Headquarters</h4>
<p class="adr" itemprop="address" itemscope itemtype="http://schema.org/PostalAddress">
<span class="street-address" itemprop="streetAddress">Kornhamnstorg 49</span>
<span class="street-address" itemprop="streetAddress"></span>
<span class="locality" itemprop="addressLocality">Stockholm,</span>
<abbr class="region" title="Stockholm" itemprop="addressRegion">Stockholm</abbr>
<span class="postal-code" itemprop="postalCode">S-11127</span>
<span class="country-name" itemprop="addressCountry">Sweden</span>
</p>
</li>
<li class="company-size">
<h4>Company Size</h4>
<p>
11-50 employees
</p>
But when I run the scrapy script I get an IndexError: list index out of range for the address (vcard hq). I've tried to rewrite the code to get the data but it does not work. The rest of the spider works fine. Am I missing something?
Your example works fine. But I guess your xpath expressions failed on another page or html part.
The problem is the use of indexes (span[3]) in the headquarters_list xpath expression. Using indexes you heavily depend on:
1. The total number of the span elements
2. On the exact order of the span elements
In general the use of indexes tend to make xpath expressions more fragile and more likely to fail. Thus, if possible, I would always avoid the use of indexes. In your example you actually take the locality of the address info. The span element can also easily be referenced by its class name which makes your expression much more robust:
//li[#class="vcard hq"]/p/span[#class='locality']/text()
Here is my testing code according to your problem description:
# -*- coding: utf-8 -*-
from scrapy.selector import Selector
html_text = """
<li class="type">
<h4>Type</h4>
<p>
Privately Held
</p>
</li>
<li class="vcard hq">
<h4>Headquarters</h4>
<p class="adr" itemprop="address" itemscope itemtype="http://schema.org/PostalAddress">
<span class="street-address" itemprop="streetAddress">Kornhamnstorg 49</span>
<span class="street-address" itemprop="streetAddress"></span>
<span class="locality" itemprop="addressLocality">Stockholm,</span>
<abbr class="region" title="Stockholm" itemprop="addressRegion">Stockholm</abbr>
<span class="postal-code" itemprop="postalCode">S-11127</span>
<span class="country-name" itemprop="addressCountry">Sweden</span>
</p>
</li>
<li class="company-size">
<h4>Company Size</h4>
<p>
11-50 employees
</p>
"""
sel = Selector(text=html_text)
companytype_list = sel.xpath(
'''.//li[#class="type"]/p/text()''').extract()
headquarters_list = sel.xpath(
'''.//li[#class="vcard hq"]/p/span[3]/text()''').extract()
companysize_list = sel.xpath(
'''.//li[#class="company-size"]/p/text()''').extract()
It doesn't raise any exception. So chances are there exist web pages with a different structure causing errors.
It's a good practice to not using index directly in xpath rules. dron22's answer gives an awesome explanation.

Categories