Parse and get result with special formatting with BeautifulSoup

Parse and get result with special formatting with BeautifulSoup - python

I'm newbies and i started with BeautifulSoup and Python dev and i want to get a result in full text without any HTML tags or other elements that are not text.
I did this with python:
#!/usr/bin/env python
import urllib2
from bs4 import BeautifulSoup
html_content = urllib2.urlopen("http://www.demo.com/index.php")
soup = BeautifulSoup(html_content, "lxml")
# COMMENTS COUNT
count_comment = soup.find("span", "sidebar-comment__label")
count_comment
count_comment_final = count_comment.find_next("meta")
# READ COUNT
count_read = soup.find("span", "sidebar-read__label js-read")
count_read
count_read_final = count_read.find_next("meta")
# PRINT RESULT
print count_comment_final
print count_read_final
My HTML look like this :
<div class="box">
<span class="sidebar-comment__label">Comments</span>
<meta itemprop="interactionCount" content="Comments:115">
</div>
<div class="box">
<span class="sidebar-read__label js-read">Read</span>
<meta itemprop="interactionCount" content="Read:10">
</div>
and I get this:
<meta content="Comments:115" itemprop="interactionCount"/>
<meta content="Read:10" itemprop="interactionCount"/>
I would get this:
You've 115 comments
You've 10 read
Firstly, is it possible ?
Secondly, Is my code is good?
Thirdly, could you help me? ;-)

The count_comment_final and count_read_final are tags as clearly seen from the output. You need to extract the value of the attribute content of the two tags. which is done using count_comment_final['content'] which will give as Comments:115, strip off the Comments: using split(':')
#!/usr/bin/env python
import urllib2
from bs4 import BeautifulSoup
html_content = urllib2.urlopen("http://www.demo.com/index.php")
soup = BeautifulSoup(html_content, "lxml")
# COMMENTS COUNT
count_comment = soup.find("span", "sidebar-comment__label")
count_comment
count_comment_final = count_comment.find_next("meta")
# READ COUNT
count_read = soup.find("span", "sidebar-read__label js-read")
count_read
count_read_final = count_read.find_next("meta")
# PRINT RESULT
print count_comment_final['content'].split(':')[1]
print count_read_final['content'].split(':')[1]

count_comment_final and count_read_final are tag elements,
You can use,
count_comment_final.get('content')
This will give a output like this,
'Comments:115'
So you can get the comments count like,
count_comment_final.get('content').split(':')[1]
Same is applicable to count_read_final,
count_read_final.get('content').split(':')[1]

Related

Python - BeautifulSoup - How to return two different elements or more, with different attributes?

HTML Exemple
<html>
<div book="blue" return="abc">
<h4 class="link">www.example.com</h4>
<p class="author">RODRIGO</p>
</html>
Ex1:
url = urllib.request.urlopen(url)
page_soup = soup(url.read(), "html.parser")
res=page_soup.find_all(attrs={"class": ["author","link"]})
for each in res:
print(each)
Result1:
www.example.com
RODRIGO
Ex2:
url = urllib.request.urlopen(url)
page_soup = soup(url.read(), "html.parser")
res=page_soup.find_all(attrs={"book": ["blue"]})
for each in res:
print(each["return")
Result 2:
abc
!!!puzzle!!!
The question I have is how to return the 3 results in a single query?
Result 3
www.example.com
RODRIGO
abc

Example HTML seems to be broken - Assuming the div wrappes the other tags and it is may not the only book you can select all books:
for e in soup.find_all(attrs={"book": ["blue"]}):
print(' '.join(e.stripped_strings),e.get('return'))
Example
from bs4 import BeautifulSoup
html = '''
<html>
<div book="blue" return="abc">
<h4 class="link">www.rodrigo.com</h4>
<p class="author">RODRIGO</p>
</html>
'''
soup = BeautifulSoup(html)
for e in soup.find_all(attrs={"book": ["blue"]}):
print(' '.join(e.stripped_strings),e.get('return'))
Output
www.rodrigo.com RODRIGO abc
A more structured example could be:
data = []
for e in soup.select('[book="blue"]'):
data.append({
'link':e.h4.text,
'author':e.select_one('.author').text,
'return':e.get('return')
})
data
Output:
[{'link': 'www.rodrigo.com', 'author': 'RODRIGO', 'return': 'abc'}]

For the case one attribute against many values a regex approach is suggested:
from bs4 import BeautifulSoup
import re
html = """<html>
<div book="blue" return="abc">
<h4 class="link">www.rodrigo.com</h4>
<p class="author">RODRIGO</p>
</html>"""
soup = BeautifulSoup(html, 'lxml')
by_clss = soup.find_all(class_=re.compile(r'link|author'))
print(b_clss)
For more flexibility, a custom query function can be passed to find or find_all:
from bs4 import BeautifulSoup
html = """<html>
<div href="blue" return="abc"></div> <!-- div need a closing tag in a html-doc-->
<h4 class="link">www.rodrigo.com</h4>
<p class="author">RODRIGO</p>
</html>"""
def query(tag):
if tag.has_attr('class'):
# tag['class'] is a list. Here assumed that has only one value
return set(tag['class']) <= {'link', 'author'}
if tag.has_attr('book'):
return tag['book'] in {'blue'}
return False
print(soup.find_all(query))
# [<div book="blue" return="abc"></div>, <h4 class="link">www.rodrigo.com</h4>, <p class="author">RODRIGO</p>]
Notice that your html-sample has no closing div-tag. In my second case I added it otherwise the soup... will not taste good.
EDIT
To retrieve elements which satisfies a simultaneous conditions on attributes the query could look like this:
def query_by_attrs(**tag_kwargs):
# tag_kwargs: {attr: [val1, val2], ...}
def wrapper(tag):
for attr, values in tag_kwargs.items():
if tag.has_attr(attr):
# check if tag has multi-valued attributes (class,...)
if not isinstance((tag_attr:=tag[attr]), list): # := for python >=3.8
tag_attr = (tag_attr,) # as tuple
return bool(set(tag_attr).intersection(values)) # false if empty set
return wrapper
q_data = {'class': ['link', 'author'], 'book': ['blue']}
results = soup.find_all(query_by_attrs(**q_data))
print(results)

Extract All link from WebSite
import requests
from bs4 import BeautifulSoup
url = 'https://mixkit.co/free-stock-music/hip-hop/'
reqs = requests.get(url)
soup = BeautifulSoup(reqs.text, 'html.parser')
urls = []
for link in soup.find_all('a'):
print(link.get('href'))

Question about using beautifulsoup to parse html

I just got started out learning to use BeautifulSoup in Python to parse html and have a very simple stupid question. Somehow, I just couldn't get Text 1 only from the html below (stored in containers).
....
<div class="listA">
<span><span>Text 1</span><b>Text 2</b><b>Text 3</b></span>
</div>
...
soup = BeautifulSoup(driver.page_source, 'html.parser')
containers = soup.findAll("div", {"class": "listA"})
datas = []
for data in containers:
textspan = data.find("span")
datas.append(textspan.text)
The output is as follows: Text1Text2Text3
Any advice how to delimit them as well? Thanks and much appreciated!

if you just want Text 1 use this code
import bs4
content = "<span><span>Text 1</span><b>Text 2</b><b>Text 3</b></span>"
soup = bs4.BeautifulSoup(content, 'html.parser')
# soup('span') will give you
# [<span><span>Text 1</span><b>Text 2</b><b>Text 3</b></span>, <span>Text 1</span>]
span_text = soup('span')
for e in span_text:
if not e('span'):
print(e.text)
Output:
Text 1

Another solution involves simplifieddoc, which does not rely on third-party libraries and is lighter and faster, perfect for beginners.
Here are more examples here
from simplified_scrapy.simplified_doc import SimplifiedDoc
html ='''
<span><span>Text 1</span><b>Text 2</b><b>Text 3</b></span>
'''
doc = SimplifiedDoc(html)
span = doc.span # Get the outermost span
first = span.span # Get the first span in span
print (first.text)
second = span.b
print (second.text)
third = second.next
print (third.text)
Result:
Text 1
Text 2
Text 3

Scraping website, but want to choose an img URL from a srcset and do it nine more times

I'm trying to scrape the BBC Sounds website for **all of the ** 'currently playing' images. I'm not bothered about which size to use, 400w might be a good.
Below is a relevant excerpt from the HTML and my current python script. A variation on this works brilliantly for the 'now playing' text, but I haven't been able to get it to work for the image URLs, which is what I'm after, I think probably because a) there's so many image URLs to choose from and b) there's a whitespace which no doubt the parser doesn't like. Please bear in mind the HTML code below is repeated about 10 times for each of the channels. I've included just one as an example. Thank you!
import requests
from bs4 import BeautifulSoup
url = "https://www.bbc.co.uk/sounds"
r = requests.get(url)
soup = BeautifulSoup(r.content, "lxml")
g_data = soup.find_all("div", {"class": "sc-o-responsive-image__img sc-u-circle"})
print g_data[0].text
print g_data[1].text
print g_data[2].text
print g_data[3].text
print g_data[4].text
print g_data[5].text
print g_data[6].text
print g_data[7].text
print g_data[8].text
print g_data[9].text
.
<div class="gel-layout__item sc-o-island">
<div class="sc-c-network-item__image sc-o-island" aria-hidden="true">
<div class="sc-c-rsimage sc-o-responsive-image sc-o-responsive-image--1by1 sc-u-circle">
<img alt="" class="sc-o-responsive-image__img sc-u-circle"
src="https://ichef.bbci.co.uk/images/ic/400x400/p07fzzgr.jpg" srcSet="https://ichef.bbci.co.uk/images/ic/160x160/p07fzzgr.jpg 160w,
https://ichef.bbci.co.uk/images/ic/192x192/p07fzzgr.jpg 192w,
https://ichef.bbci.co.uk/images/ic/224x224/p07fzzgr.jpg 224w,
https://ichef.bbci.co.uk/images/ic/288x288/p07fzzgr.jpg 288w,
https://ichef.bbci.co.uk/images/ic/368x368/p07fzzgr.jpg 368w,
https://ichef.bbci.co.uk/images/ic/400x400/p07fzzgr.jpg 400w,
https://ichef.bbci.co.uk/images/ic/448x448/p07fzzgr.jpg 448w,
https://ichef.bbci.co.uk/images/ic/496x496/p07fzzgr.jpg 496w,
https://ichef.bbci.co.uk/images/ic/512x512/p07fzzgr.jpg 512w,
https://ichef.bbci.co.uk/images/ic/576x576/p07fzzgr.jpg 576w,
https://ichef.bbci.co.uk/images/ic/624x624/p07fzzgr.jpg 624w"
sizes="(max-width: 400px) 34vw,(max-width: 600px) 25vw,17vw"/>

import requests
from bs4 import BeautifulSoup
r = requests.get("https://www.bbc.co.uk/sounds")
soup = BeautifulSoup(r.text, 'html.parser')
for item in soup.findAll("img", {'class': 'sc-o-responsive-image__img sc-u-circle'}):
print(item.get("src"))
Output:
https://ichef.bbci.co.uk/images/ic/400x400/p05mpj80.jpg
https://ichef.bbci.co.uk/images/ic/400x400/p07dg040.jpg
https://ichef.bbci.co.uk/images/ic/400x400/p07zml97.jpg
https://ichef.bbci.co.uk/images/ic/400x400/p0428n3t.jpg
https://ichef.bbci.co.uk/images/ic/400x400/p01lyv4b.jpg
https://ichef.bbci.co.uk/images/ic/400x400/p06yphh0.jpg
https://ichef.bbci.co.uk/images/ic/400x400/p05v4t1c.jpg
https://ichef.bbci.co.uk/images/ic/400x400/p06z9zzc.jpg
https://ichef.bbci.co.uk/images/ic/400x400/p06x0hxb.jpg
https://ichef.bbci.co.uk/images/ic/400x400/p06n253f.jpg
https://ichef.bbci.co.uk/images/ic/400x400/p060m6jj.jpg
https://ichef.bbci.co.uk/images/ic/400x400/p07l4fjw.jpg
https://ichef.bbci.co.uk/images/ic/400x400/p03710d6.jpg
https://ichef.bbci.co.uk/images/ic/400x400/p07nn0dw.jpg
https://ichef.bbci.co.uk/images/ic/400x400/p07nn0dw.jpg
https://ichef.bbci.co.uk/images/ic/400x400/p078qrgm.jpg
https://ichef.bbci.co.uk/images/ic/400x400/p07sq0gr.jpg
https://ichef.bbci.co.uk/images/ic/400x400/p07sq0gr.jpg
https://ichef.bbci.co.uk/images/ic/400x400/p03crmyc.jpg

How to scrape second <p> of webpage using python and Beautifulsoup

I've been trying to work with BeautifulSoup because I want to try and scrape a webpage (https://www.imdb.com/search/title?release_date=2017&sort=num_votes,desc&page=1). So far I scraped some elements with success but now I wanted to scrape a movie description but I've been struggling. The description is simply situated like this in html :
<div class="lister-item mode-advanced">
<div class="lister-item-content>
<p class="muted-text"> paragraph I don't need</p>
<p class="muted-text"> paragraph I need</p>
</div>
</div>
I want to scrape the second paragraph which seemed easy to do but everything I tried gave me 'None' as output. I've been digging around to find an answer. In an other stackoverflow post I found that
find('p:nth-of-type(1)')
or
find_elements_by_css_selector('.lister-item-mode >p:nth-child(1)')
could do the trick but it still gives me
none #as output
Below you can find a piece of my code it's a bit low grade code because I'm just trying out stuff to learn
import urllib2
from bs4 import BeautifulSoup
from requests import get
url = 'http://www.imdb.com/search/title?
release_date=2017&sort=num_votes,desc&page=1'
response = get(url)
html_soup = BeautifulSoup(response.text, 'html.parser')
type(html_soup)
movie_containers = html_soup.find_all('div', class_='lister-item mode-
advanced')
first_movie = movie_containers[0]
first_title = first_movie.h3.a.text
print first_title
first_year = first_movie.h3.find('span', class_='lister-item-year text-muted unbold')
first_year = first_year.text
print first_year
first_imdb = float(first_movie.strong.text)
print first_imdb
# !!!! problem zone ---------------------------------------------
first_description = first_movie.find('p', class_='muted-text')
#first_description = first_description.text
print first_description
the above code gives me this output:
$ python scrape.py
Logan
(2017)
8.1
None
I would like to learn the correct method of selecting html tags because it will be useful to know for future projects.

find_all() method looks through a tag’s descendants and retrieves
all descendants that match your filters.
You can then use the list's index to get the element you need. Index starts at 0, so 1 will give the second item.
Change the first_description to this.
first_description = first_movie.find_all('p', {"class":"text-muted"})[1].text.strip()
Full code
import urllib2
from bs4 import BeautifulSoup
from requests import get
url = 'http://www.imdb.com/search/title?release_date=2017&sort=num_votes,desc&page=1'
response = get(url)
html_soup = BeautifulSoup(response.text, 'html.parser')
type(html_soup)
movie_containers = html_soup.find_all('div', class_='lister-item mode-advanced')
first_movie = movie_containers[0]
first_title = first_movie.h3.a.text
print first_title
first_year = first_movie.h3.find('span', class_='lister-item-year text-muted unbold')
first_year = first_year.text
print first_year
first_imdb = float(first_movie.strong.text)
print first_imdb
# !!!! problem zone ---------------------------------------------
first_description = first_movie.find_all('p', {"class":"text-muted"})[1].text.strip()
#first_description = first_description.text
print first_description
Output
Logan
(2017)
8.1
In the near future, a weary Logan cares for an ailing Professor X. However, Logan's attempts to hide from the world and his legacy are upended when a young mutant arrives, pursued by dark forces.
Read the Documentation to learn the correct method of selecting html tags.
Also consider moving to python 3.

Just playing around with .next_sibling was able to get it. There's probably a more elegant way though. At least might give you a start/some direction
from bs4 import BeautifulSoup
html = '''<div class="lister-item mode-advanced">
<div class="lister-item-content>
<p class="muted-text"> paragraph I don't need</p>
<p class="muted-text"> paragraph I need</p>
</div>
</div>'''
soup = BeautifulSoup(html, 'html.parser')
first_p = soup.find('div',{'class':'lister-item mode-advanced'}).text.strip()
second_p = soup.find('div',{'class':'lister-item mode-advanced'}).next_sibling.next_sibling.text.strip()
print (second_p)
Output:
print (second_p)
paragraph I need

BeautifulSoup 4.71 support :nth-child() or any CSS4 selectors
first_description = soup.select_one('.lister-item-content p:nth-child(4)')
# or
#first_description = soup.select_one('.lister-item-content p:nth-of-type(2)')
print(desc)

Urllib2 in python: why is it not returning the webpage formatting and not the actual data

can someone tell me why, when I run this code:
import urllib2
for i in range(1,2):
id_name ='AP' + str("{:05d}".format(i))
web_page = "http://aps.unmc.edu/AP/database/query_output.php?ID=" + id_name
page = urllib2.urlopen(web_page)
html = page.read()
print html
It returns:
<html>
<head>
<title>detailed information</title>
<style type="text/css">
H1 {font-family:"Time New Roman", Times; font-style:bold; font-size:18pt; color:blue}
H1{text-align:center}
P{font-family:"Time New Roman", Times; font-style:bold; font-size:14pt; line-height:20pt}
P{text-align:justify;margin-left:0px; margin-right:0px;color:blue}
/body{background-image:url('sky.gif')}
/
A:link{color:blue}
A:visited{color:#996666}
</style>
</head>
<H1>Antimicrobial Peptide APAP00001</H1>
<html>
<p style="margin-left: 400px; margin-top: 4; margin-bottom: 0; line-height:100%">
<b>
<a href = "#" onclick = "window.close(self)"><font size="3" color=blue>Close this window
</font> </a>
</b>
</p>
</p>
</body>
</html>
And not the actual data on the page (http://aps.unmc.edu/AP/database/query_output.php?ID=00001) (e.g. net charge, length)?
If I edit this code slightly somehow, is it possible to return all of the information on the page (e.g. the information about net charge, length etc), and not just information about how the page is formatted?
Thanks
Edit 1: Due to Gahan's comment below, I tried this:
import requests
from bs4 import BeautifulSoup
for i in range(8,9):
webpage = "https://dbaasp.org/peptide-card?type=39&id=" + str(i)
response = requests.get(webpage)
soup = BeautifulSoup(response.content, 'html.parser')
print soup
However, I still seem the same answer (for example, if I run the edit 1 code and direct output to a file, and then grep the peptide sequence in the output file, it is not there).

In your original snippet, you use "AP00001" as query param:
id_name ='AP' + str("{:05d}".format(i))
so your url is: "http://aps.unmc.edu/AP/database/query_output.php?ID=AP00001", instead of "http://aps.unmc.edu/AP/database/query_output.php?ID=00001"
A fixed version of your first snippet using requests:
url = "http://aps.unmc.edu/AP/database/query_output.php"
for i in range(1,2):
id_name = "{:05d}".format(i)
response = requests.get(url, params={"ID":id_name})
print response.content

use requests library:
import requests
from bs4 import BeautifulSoup
data_require = ["Net charge", ]
for i in range(1,2):
id_value ="{:05d}".format(i)
url = "http://aps.unmc.edu/AP/database/query_output.php"
payload = {"ID": id_value}
response = requests.get(url, params=payload)
soup = BeautifulSoup(response.content, 'html.parser')
table_structure = soup.find('table')
all_p_tag = table_structure.find_all('p')
data = {}
for i in range(0, len(all_p_tag), 2):
data[all_p_tag[i].text] = all_p_tag[i+1].text.encode('utf-8').strip()
print("{} {}".format(all_p_tag[i].text, all_p_tag[i+1].text.encode('utf-8').strip()))
print(data)
Note:
you don't need to convert "{:05d}".format(i) to string as it will only return string when you use format() because it's string formatting.
also I have updated code to get tag details too.
you don't need to use grep for it because BeautifulSoup is already providing such facility.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Parse and get result with special formatting with BeautifulSoup - python

Related

Python - BeautifulSoup - How to return two different elements or more, with different attributes?

Question about using beautifulsoup to parse html

Scraping website, but want to choose an img URL from a srcset and do it nine more times

How to scrape second <p> of webpage using python and Beautifulsoup

Urllib2 in python: why is it not returning the webpage formatting and not the actual data

Categories

Resources