Parsing HTML with BeautifulSoup

Parsing HTML with BeautifulSoup - python

(Picture is small, here is another link: http://i.imgur.com/OJC0A.png)
I'm trying to extract the text of the review at the bottom. I've tried this:
y = soup.find_all("div", style = "margin-left:0.5em;")
review = y[0].text
The problem is that there is unwanted text in the unexpanded div tags that becomes tedious to remove from the content of the review. For the life of me, I just can't figure this out. Could someone please help me?
Edit: The HTML is:
div style="margin-left:0.5em;">
<div style="margin-bottom:0.5em;"> 9 of 35 people found the following review helpful </div>
<div style="margin-bottom:0.5em;">
<div style="margin-bottom:0.5em;">
<div class="tiny" style="margin-bottom:0.5em;">
<b>
</div>
That is true. I tried it myself this morning. There's a little note on the Audible site that says "a few titles will require two credits" or something like that. A Dance with Dragons is one of those few.
The div tag above the text is as follows:
<div class="tiny" style="margin-bottom:0.5em;">
<b>
<span class="h3color tiny">This review is from: </span>
A Dance with Dragons: A Song of Ice and Fire: Book 5 (Audible Audio Edition)
</b>
</div>
That is true. I tried it myself this morning. There's a little note on the Audible site that says "a few titles will require two credits" or something like that. A Dance with Dragons is one of those few.

http://www.crummy.com/software/BeautifulSoup/bs4/doc/#strings-and-stripped-strings suggests that the .strings method is what you want - it returns a iterator of each string within the object. So if you turn that iterator into a list and take the last item, you should get what you want. For example:
$ python
>>> import bs4
>>> text = '<div style="mine"><div>unwanted</div>wanted</div>'
>>> soup = bs4.BeautifulSoup(text)
>>> soup.find_all("div", style="mine")[0].text
u'unwantedwanted'
>>> list(soup.find_all("div", style="mine")[0].strings)[-1]
u'wanted'

To get the text in the tail of div.tiny:
review = soup.find("div", "tiny").findNextSibling(text=True)
Full example:
#!/usr/bin/env python
from bs4 import BeautifulSoup
html = """<div style="margin-left:0.5em;">
<div style="margin-bottom:0.5em;">
9 of 35 people found the following review helpful </div>
<div style="margin-bottom:0.5em;">
<div style="margin-bottom:0.5em;">
<div class="tiny" style="margin-bottom:0.5em;">
<b>
<span class="h3color tiny">This review is from: </span>
<a href="http://rads.stackoverflow.com/amzn/click/B005C7QVUE">
A Dance with Dragons: A Song of Ice and Fire: Book 5 (Audible Audio Edition)</a>
</b>
</div>
That is true. I tried it myself this morning. There's a little note on the Audible site that says "a few titles will require two credits" or something like that. A Dance with Dragons is one of those few."""
soup = BeautifulSoup(html)
review = soup.find("div", "tiny").findNextSibling(text=True)
print(review)
Output
That is true. I tried it myself this morning. There's a little note on the Audible site that says "a few titles will require two credits" or something like that. A Dance with Dragons is one of those few.
Here's an equivalent lxml code that produces the same output:
import lxml.html
doc = lxml.html.fromstring(html)
print doc.find(".//div[#class='tiny']").tail

Related

Get text on individual links using xpath and regex

I am working on a scrapy project and we are scraping a news website.
There is a div that contains the sites tags and it may have several links.
For example:
<div class="article__tags">
<a href="/example/ops.html">
OPS
</a>
<a href="/example/covid-19.html">
Covid-19
</a>
<a href="/example/usa.html">
USA
</a>
</div>
and i am trying to get the individual tags.
I am doing it like this:
tags = html.xpath(
'//div[#class="article__tags"]/a/text()').re('(\w+)')
And in the above example i get the following tags:
OPS
USA
COVID
19
which is incorrect since covid and 19 are the same tag.
¿How can get the links texts correct?
Thank you

I managed to do it by changing it to
tags = html.xpath(
'//div[#class="article__tags"]/a/text()').extract()

Python ftech Title and pdf link from an url

I'm trying to fetch the Book Title and books embeded url link from an url, the html source content of the url looks like below, i have Just taken some little portion out of it to understand.
The when link name is here .. However the little source html portion as follows..
<section>
<div class="book row" isbn-data="1601982941">
<div class="col-lg-3">
<div class="book-cats">Artificial Intelligence</div>
<div style="width:100%;">
<img alt="Learning Deep Architectures for AI" class="book-cover" height="261" src="https://storage.googleapis.com/lds-media/images/Learning-Deep-Architectures-for-AI_2015_12_30_.width-200.png" width="200"/>
</div>
</div>
<div class="col-lg-6">
<div class="star-ratings"></div>
<h2>Learning Deep Architectures for AI</h2>
<span class="meta-auth"><b>Yoshua Bengio, 2009</b></span>
<div class="meta-auth-ttl"></div>
<p>Foundations and Trends(r) in Machine Learning.</p>
<div>
<a class="btn" href="http://www.iro.umontreal.ca/~bengioy/papers/ftml_book.pdf" rel="nofollow">View Free Book</a>
<a class="btn" href="http://amzn.to/1WePh0N" rel="nofollow">See Reviews</a>
</div>
</div>
</div>
</section>
<section>
<div class="book row" isbn-data="1496034023">
<div class="col-lg-3">
<div class="book-cats">Artificial Intelligence</div>
<div style="width:100%;">
<img alt="The LION Way: Machine Learning plus Intelligent Optimization" class="book-cover" height="261" src="https://storage.googleapis.com/lds-media/images/The-LION-Way-Learning-plus-Intelligent-Optimiz.width-200.png" width="200"/>
</div>
</div>
<div class="col-lg-6">
<div class="star-ratings"></div>
<h2>The LION Way: Machine Learning plus Intelligent Optimization</h2>
<span class="meta-auth"><b>Roberto Battiti & Mauro Brunato, 2013</b></span>
<div class="meta-auth-ttl"></div>
<p>Learning and Intelligent Optimization (LION) is the combination of learning from data and optimization applied to solve complex and dynamic problems. Learn about increasing the automation level and connecting data directly to decisions and actions.</p>
<div>
<a class="btn" href="http://www.e-booksdirectory.com/details.php?ebook=9575" rel="nofollow">View Free Book</a>
<a class="btn" href="http://amzn.to/1FcalRp" rel="nofollow">See Reviews</a>
</div>
</div>
</div>
</section>
I have tried below code:
This code just fetched the Book name or Title but still has header <h2> printing. I am looking forward to print Book name and book's pdf link as well.
#!/usr/bin/python3
from bs4 import BeautifulSoup as bs
import urllib
import urllib.request as ureq
web_res = urllib.request.urlopen("https://www.learndatasci.com/free-data-science-books/").read()
soup = bs(web_res, 'html.parser')
headers = soup.find_all(['h2'])
print(*headers, sep='\n')
#divs = soup.find_all('div')
#print(*divs, sep="\n\n")
header_1 = soup.find_all('h2', class_='book-container')
print(header_1)
output:
<h2>Artificial Intelligence A Modern Approach, 1st Edition</h2>
<h2>Learning Deep Architectures for AI</h2>
<h2>The LION Way: Machine Learning plus Intelligent Optimization</h2>
<h2>Big Data Now: 2012 Edition</h2>
<h2>Disruptive Possibilities: How Big Data Changes Everything</h2>
<h2>Real-Time Big Data Analytics: Emerging Architecture</h2>
<h2>Computer Vision</h2>
<h2>Natural Language Processing with Python</h2>
<h2>Programming Computer Vision with Python</h2>
<h2>The Elements of Data Analytic Style</h2>
<h2>A Course in Machine Learning</h2>
<h2>A First Encounter with Machine Learning</h2>
<h2>Algorithms for Reinforcement Learning</h2>
<h2>A Programmer's Guide to Data Mining</h2>
<h2>Bayesian Reasoning and Machine Learning</h2>
<h2>Data Mining Algorithms In R</h2>
<h2>Data Mining and Analysis: Fundamental Concepts and Algorithms</h2>
<h2>Data Mining: Practical Machine Learning Tools and Techniques</h2>
<h2>Data Mining with Rattle and R</h2>
<h2>Deep Learning</h2>
Desired Output:
Title: Artificial Intelligence A Modern Approach, 1st Edition
Link: http://www.cin.ufpe.br/~tfl2/artificial-intelligence-modern-approach.9780131038059.25368.pdf
Please help me understand how to achive this as I have googled around but due to lack of knowlede i'm unable to get it. as when i see the html source there are lot of div and class , so little confused to opt which class to fetch the href and h2.

The HTML is very nicely structured and you can make use of that here. The site evidently uses Bootstrap as a style scaffolding (with row and col-[size]-[gridcount] classes you can mostly ignore.
You essentially have:
a <div class="book"> per book
a column with
<div class="book-cats"> category and
image
a second column with
<div class="star-ratings"> ratings block
<h2> book title
<span class="meta-auth"> author line
<p> book description
two links with <a class=“btn" ...>
Most of those can be ignored. Both the title and your desired link are the first element of their type, so you could just use element.nested_element to grab either.
So all you have to do is
loop over all the book divs.
for every such div, take the h2 and first a elements.
For the title take the contained text of the h2
For the link take the href attribute of the a anchor link.
like this:
for book in soup.select("div.book:has(h2):has(a.btn[href])"):
title = book.h2.get_text(strip=True)
link = book.select_one("a.btn[href]")["href"]
# store or process title and link
print("Title:", title)
print("Link:", link)
I used .select_one() with a CSS selector to be a bit more specific about what link element to accept; .btn specifies the class and [href] that a href attribute must be present.
I also enhanced the book search by limiting it to divs that have both a title and at least 1 link; the :has(...) selector limits matches to those with specific child elements.
The above produces:
Title: Artificial Intelligence A Modern Approach, 1st Edition
Link: http://www.cin.ufpe.br/~tfl2/artificial-intelligence-modern-approach.9780131038059.25368.pdf
Title: Learning Deep Architectures for AI
Link: http://www.iro.umontreal.ca/~bengioy/papers/ftml_book.pdf
Title: The LION Way: Machine Learning plus Intelligent Optimization
Link: http://www.e-booksdirectory.com/details.php?ebook=9575
... etc ...

You can get the main idea from this code:
for items in zip(soup.find_all(['h2']), soup.find_all('a', class_="btn")):
h2, href = items[0].text, items[1].get('href')
print('Title:', h2)
print('Link:', href)

extract text from html with python xpath

I want to extract the price of a player in futbin. Some part of the html is here:
<div class="pr pr_pc" id="pr_pc">PR: 10,250 - 150,000</div>
<div id="pclowest" class="hide">23500</div>
I´ve programmed this with python:
from lxml import html
import requests
page = requests.get('https://www.futbin.com/18/player/15660/Mar%C3%A7al/')
tree = html.fromstring(page.content)
player = tree.xpath('//*[#id="pclowest"]')
print 'player: ', player
I want to extract the value 23500 automatically, but i cannot. Someone can help me?
Edit:
Theres another piece of code where data maybe can be extracted:
<div class="bin_price lbin">
<span class="price_big_right">
<span id="pc-lowest-1" data-price="23,000">23,000 <img alt="c" class="coins_icon_l_bin" src="https://cdn.futbin.com/design/img/coins_bin.png">
</span>
</span>
</div>
Could be possible to extract data-price here?

Having problems understanding BeautifulSoup filtering

Could someone please explain how the filtering works with Beautiful Soup. Ive got the below HTML I am trying to filter specific data from but I cant seem to access it. Ive tried various approaches, from gathering all class=g's to grabbing just the items of interest in that specific div, but I just get None returns or no prints.
Each page has a <div class="srg"> div with multiple <div class="g"> divs, the data i am looking to use is the data withing <div class="g">. Each of these has
multiple divs, but im only interested in the <cite> and <span class="st"> data. I am struggling to understand how the filtering works, any help would be appreciated.
I have attempted stepping through the divs and grabbing the relevant fields:
soup = BeautifulSoup(response.text)
main = soup.find('div', {'class': 'srg'})
result = main.find('div', {'class': 'g'})
data = result.find('div', {'class': 's'})
data2 = data.find('div')
for item in data2:
site = item.find('cite')
comment = item.find('span', {'class': 'st'})
print site
print comment
I have also attempted stepping into the initial div and finding all;
soup = BeautifulSoup(response.text)
s = soup.findAll('div', {'class': 's'})
for result in s:
site = result.find('cite')
comment = result.find('span', {'class': 'st'})
print site
print comment
Test Data
<div class="srg">
<div class="g">
<div class="g">
<div class="g">
<div class="g">
<!--m-->
<div class="rc" data="30">
<div class="s">
<div>
<div class="f kv _SWb" style="white-space:nowrap">
<cite class="_Rm">http://www.url.com.stuff/here</cite>
<span class="st">http://www.url.com. Some info on url etc etc
</span>
</div>
</div>
</div>
<!--n-->
</div>
<div class="g">
<div class="g">
<div class="g">
</div>
UPDATE
After Alecxe's solution I took another stab at getting it right but was still not getting anything printed. So I decided to take another look at the soup and it looks different. I was previously looking at the response.text from requests. I can only think that BeautifulSoup modifies the response.text or I somehow just got the sample completely wrong the first time (not sure how). However Below is the new sample based on what I am seeing from a soup print. And below that my attempt to get to the element data I am after.
<li class="g">
<h3 class="r">
context
</h3>
<div class="s">
<div class="kv" style="margin-bottom:2px">
<cite>www.url.com/index.html</cite> #Data I am looking to grab
<div class="_nBb">‎
<div style="display:inline"snipped">
<span class="_O0"></span>
</div>
<div style="display:none" class="am-dropdown-menu" role="menu" tabindex="-1">
<ul>
<li class="_Ykb">
<a class="_Zkb" href="/url?/search">Cached</a>
</li>
</ul>
</div>
</div>
</div>
<span class="st">Details about URI </span> #Data I am looking to grab
Update Attempt
I have tried taking Alecxe's approach to no success so far, am I going down the right road with this?
soup = BeautifulSoup(response.text)
for cite in soup.select("li.g div.s div.kv cite"):
span = cite.find_next_sibling("span", class_="st")
print(cite.get_text(strip=True))
print(span.get_text(strip=True))

First get div with class name srg then find all div with class name s inside that srg and get text of that site and comment. Below is the working code for me-
from bs4 import BeautifulSoup
html = """<div class="srg">
<div class="g">
<div class="g">
<div class="g">
<div class="g">
<!--m-->
<div class="rc" data="30">
<div class="s">
<div>
<div class="f kv _SWb" style="white-space:nowrap">
<cite class="_Rm">http://www.url.com.stuff/here</cite>
<span class="st">http://www.url.com. Some info on url etc etc
</span>
</div>
</div>
</div>
<!--n-->
</div>
<div class="g">
<div class="g">
<div class="g">
</div>"""
soup = BeautifulSoup(html , 'html.parser')
labels = soup.find('div',{"class":"srg"})
spans = labels.findAll('div', {"class": 'g'})
sites = []
comments = []
for data in spans:
site = data.find('cite',{'class':'_Rm'})
comment = data.find('span',{'class':'st'})
if site:#Check if site in not None
if site.text.strip() not in sites:
sites.append(site.text.strip())
else:
pass
if comment:#Check if comment in not None
if comment.text.strip() not in comments:
comments.append(comment.text.strip())
else: pass
print sites
print comments
Output-
[u'http://www.url.com.stuff/here']
[u'http://www.url.com. Some info on url etc etc']
EDIT--
Why your code does not work
For try One-
You are using result = main.find('div', {'class': 'g'}) it will grab single and first encountered element but first element has not div with class name s . So the next part of this code will not work.
For try Two-
You are printing site and comment that is not in the print scope. So try to print inside for loop.
soup = BeautifulSoup(html,'html.parser')
s = soup.findAll('div', {'class': 's'})
for result in s:
site = result.find('cite')
comment = result.find('span', {'class': 'st'})
print site.text#Grab text
print comment.text

You don't have to deal with the hierarchy manually - let BeautifulSoup worry about it. Your second approach is close to what you should really be trying to do, but it would fail once you get the div with class="s" with no cite element inside.
Instead, you need to let BeautifulSoup know that you are interested in specific elements containing specific elements. Let's ask for cite elements located inside div elements with class="g" located inside the div element with class="srg" - div.srg div.g cite CSS selector would find us exactly what we are asking about:
for cite in soup.select("div.srg div.g cite"):
span = cite.find_next_sibling("span", class_="st")
print(cite.get_text(strip=True))
print(span.get_text(strip=True))
Then, once the cite is located, we are "going sideways" and grabbing the next span sibling element with class="st". Though, yes, here we are assuming it exists.
For the provided sample data, it prints:
http://www.url.com.stuff/here
http://www.url.com. Some info on url etc etc
The updated code for the updated input data:
for cite in soup.select("li.g div.s div.kv cite"):
span = cite.find_next("span", class_="st")
print(cite.get_text(strip=True))
print(span.get_text(strip=True))
Also, make sure you are using the 4th BeautifulSoup version:
pip install --upgrade beautifulsoup4
And the import statement should be:
from bs4 import BeautifulSoup

BeautifulSoup: Parsing bad Wordpress HTML

So I need to scrape some a site using Python but the problem is that the markup is random, unstructured, and proving hard to work with.
For example
<p style='font-size: 24px;'>
<strong>Title A</strong>
</p>
<p>
<strong> First Subtitle of Title A </strong>
"Text for first subtitle"
</p>
Then it will switch to
<p>
<strong style='font-size: 24px;'> Second Subtitle for Title B </strong>
</p>
Then sometimes the new subtitles are added to the end of the previous subtitle's text
<p>
...title E's content finishes
<strong>
<span id="inserted31" style="font-size: 24px;"> Title F </span>
</strong>
</p>
<p>
<strong> First Subtitle for Title F </strong>
</p>
Enough confusion, it's simply poor markup. Obvious patterns such as 'font-size:24px;' can find the titles but there isn't a solid, reusable method to scrape the children and associate them with the title.
Regex might work but I feel like the randomness would result in scraping patterns that are too specific and not DRY.
I could offer to rewrite the html and fix the hierarchy, however, this being a wordpress site, I fear the content might come back as incompatible to the admin in the wordpress interface.
Any suggestions for either a better scraping method or a way to go about wordpress would be greatly appreciated. I want avoid just copying/pasting as much as possible.

At least, you can rely on the tag names and text, navigating the DOM tree horizontally - going sideways. These are all strong, p and span (with id attribute set) tags you are showing.
For example, you can get the strong text and get the following sibling:
>>> from bs4 import BeautifulSoup
>>> data = """
... <p style='font-size: 24px;'>
... <strong>Title A</strong>
... </p>
... <p>
... <strong> First Subtitle of Title A </strong>
... "Text for first subtitle"
... </p>
... """
>>> soup = BeautifulSoup(data)
>>> titles = soup.find_all('strong')
>>> titles[0].text
u'Title A'
>>> titles[1].get_text(strip=True)
u'First Subtitle of Title A'
>>> titles[1].next_sibling.strip()
u'"Text for first subtitle"'

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Parsing HTML with BeautifulSoup - python

Related

Get text on individual links using xpath and regex

Python ftech Title and pdf link from an url

extract text from html with python xpath

Having problems understanding BeautifulSoup filtering

BeautifulSoup: Parsing bad Wordpress HTML

Categories

Resources