Python How to finde the right value with soup - python

I am trying to get the proce of an item from the following html.
This is the src
<div class="a-section a-spacing-small a-spacing-top-small">
<span class="a-declarative" data-action="show-all-offers-display" data-show-all-offers-display="{}">
<a class="a-link-normal" href="/gp/offer-listing/B08HLZXHZY/ref=dp_olp_NEW_mbc?ie=UTF8&condition=NEW">
<span>Neu (3) ab </span><span class="a-size-base a-color-price">1.930,99 €</span>
</a>
</span>
<span class="a-size-base a-color-base">& <b>Kostenlose Lieferung</b></span>
</div>
This is the code that I tried
html = """\
HTML Code here from the top.
"""
soup = Soup(html)
soup.find("span", {"a-size-base a-color-price": ""}).text

There are number of issues in your code. See below:
html = """<div class="a-section a-spacing-small a-spacing-top-small">
<span class="a-declarative" data-action="show-all-offers-display" data-show-all-offers-display="{}">
<a class="a-link-normal" href="/gp/offer-listing/B08HLZXHZY/ref=dp_olp_NEW_mbc?ie=UTF8&condition=NEW">
<span>Neu (3) ab </span><span class="a-size-base a-color-price">1.930,99 €</span>
</a>
</span>
<span class="a-size-base a-color-base">& <b>Kostenlose Lieferung</b></span>
</div>
"""
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'html.parser')
print(soup.find("span", {"class":"a-size-base a-color-price"}).text.strip())
output
1.930,99 €

Related

how to print a particular value from html content?

below is HTML content I want the only value that is available in HTML content
<div class="list-group-item">
<div class="row">
<div class="col" style="min-width: 0;">
<h2 class="h5 mt-0 text-truncate">
<a class="text-warning" href="www.example.com">
Ram
</a>
</h2>
<p class="mob-9 text-truncate">
<small>
<i class="fa fa-fw fa-mobile-alt">
</i>
Contact:
</small>
010101010
</p>
<p class="mb-2 text-truncate">
<small>
<i class="fa fa-fw fa-map-marker-alt">
</i>
Location:
</small>
5th lane, kamathipura, Kamathipura
</p>
</a>
</p>
</div>
</div>
</div>
my code is -
import pandas as pd
import requests
from bs4 import BeautifulSoup as soup
url = requests.get("www.example.com")
page_soup = soup(url.content, 'html.parser')
name = shop.findAll("div", {"class": "list-group-item"})
print(name.h2.text)
number = shop.findAll("p", {"class": "fa fa-fw fa-map-marker-alt"})
print(?)
location = shop.findAll("p", {"class": "fa fa-fw fa-map-marker-alt"})
print(?)
I need output for this by using python -
'Ram', '010101010', '5th lane, kamathipura, Kamathipura'
Using the tags and class identifiers, you can grab all contents within the regions you want. Then with content indicies you should be able to select the exact content you wish like this:
from bs4 import BeautifulSoup
url = 'myhtml.html'
with open(url) as fp:
soup = BeautifulSoup(fp, 'html.parser')
contnt1 = [soup.find('a').contents[0].replace(' ','').replace('\n','')]
contnt2 = [x.contents[2].replace(' ', '').replace('\n', '') for x in soup.find_all("p", "text-truncate")]
print(*(contnt1 + contnt2))
Have you tried location.get_text() ?
You can go here and read more about it.

" \n"and other redundant data is included in list when using .contents in beautifulsoup"

<div class="pv-entity__degree-info"><h3 class="pv-entity__school-name t-16 t-black t-bold">Universitatea de Medicină și Farmacie „Grigore T. Popa” din Iași</h3>
<p class="pv-entity__secondary-title pv-entity__degree-name t-14 t-black t-normal">
<span class="visually-hidden">Degree Name</span>
<span class="pv-entity__comma-item">MD</span>
</p>
<p class="pv-entity__secondary-title pv-entity__fos t-14 t-black t-normal">
<span class="visually-hidden">Field Of Study</span>
<span class="pv-entity__comma-item">Merdicine</span>
</p>
<!-- --> </div>
i have been getting .contents on this div tag.We can clearly see it has 3 children tag.But .contents is giving a list of length 8 when it should be length 3.Why?
BeautifulSoup includes Tags and NavigableStrings in the .contents. If you want just tags, you can use .find_all() with parameter recursive=False:
from bs4 import BeautifulSoup
txt = '''<div class="pv-entity__degree-info"><h3 class="pv-entity__school-name t-16 t-black t-bold">Universitatea de Medicină și Farmacie „Grigore T. Popa” din Iași</h3>
<p class="pv-entity__secondary-title pv-entity__degree-name t-14 t-black t-normal">
<span class="visually-hidden">Degree Name</span>
<span class="pv-entity__comma-item">MD</span>
</p>
<p class="pv-entity__secondary-title pv-entity__fos t-14 t-black t-normal">
<span class="visually-hidden">Field Of Study</span>
<span class="pv-entity__comma-item">Merdicine</span>
</p>
<!-- --> </div>'''
soup = BeautifulSoup(txt, 'html.parser')
print(len(soup.div.find_all(recursive=False)))
Prints:
3

Python Scraping Unable to extract all <li> 's

import requests
from bs4 import BeautifulSoup
def get_data_from_web():
url = "http://mohfw.gov.in"
page = requests.get(url)
soup = BeautifulSoup(page.content, 'html.parser')
div = soup.find('div', class_='col-xs-8 site-stats-count')
li = div.find_all('li')
print(li)
get_data_from_web()
Im trying to extract Corona stats from http://mohfw.gov.in , but I'm getting only one first li
while there are total of 3 li,
I tried by giving class specifically for those li tags but I'm getting none response
<div class="col-xs-8 site-stats-count">
<ul style="margin-top:0px;">
<li class="bg-blue">
<strong class="mob-hide">Active <span class="active_per"></span></strong>
<strong class="mob-hide">973175<span class='up'> (14859<i class='fa fa-arrow-up'></i>)</span></strong>
<!--<span class='down'>3565 <i class='fa fa-arrow-down'></i></span>-->
<span class="mob-show">Active </span>
<span class="mob-show"><span class="active_per"></span> </span>
<span class="mob-show"><strong>973175<span class='up'><br>(14859<i class='fa fa-arrow-up'></i>)</span></strong></span> </span>
</li>
<li class="bg-green">
<strong class="mob-hide">Discharged <span class="discharged_per"></span></strong>
<strong class="mob-hide">3702595<span class='cup'> (78399<i class='fa fa-arrow-up'></i>)</span></strong>
<span class="mob-show">Discharged </span>
<span class="mob-show"><span class="discharged_per"></span> </span>
<span class="mob-show"><strong>3702595<span class='cup'><br>(78399<i class='fa fa-arrow-up'></i>)</span></strong></span> </span>
</li>
<li class="bg-red">
<strong class="mob-hide">Deaths <span class="death_per"></span></strong>
<strong class="mob-hide">78586 <span class='up'> (1114<i class='fa fa-arrow-up'></i>)</span></strong>
<span class="mob-show">Deaths </span>
<span class="mob-show"><span class="death_per"></span> </span>
<span class="mob-show"><strong>78586<span class='up'><br>(1114<i class='fa fa-arrow-up'></i>)</span></strong></span> </span>
<!--<span class='down'> <i class='fa fa-arrow-down'></i></span>-->
</li>
</ul></div>
The HTML markup on that page is broken, try to parse it with lxml or html5lib parsers:
import requests
from bs4 import BeautifulSoup
def get_data_from_web():
url = "http://mohfw.gov.in"
page = requests.get(url)
soup = BeautifulSoup(page.content, 'lxml') # <-- change to lxml or html5lib
div = soup.find('div', class_='col-xs-8 site-stats-count')
lis = div.find_all('li')
for li in lis:
print(li)
print('-' * 80)
get_data_from_web()
Prints:
<li class="bg-blue">
<strong class="mob-hide">Active  <span class="active_per"></span></strong>
<strong class="mob-hide">973175<span class="up">     (14859<i class="fa fa-arrow-up"></i>)</span></strong>
<!--<span class='down'>3565 <i class='fa fa-arrow-down'></i></span>-->
<span class="mob-show">Active </span>
<span class="mob-show"><span class="active_per"></span> </span>
<span class="mob-show"><strong>973175<span class="up"><br/>(14859<i class="fa fa-arrow-up"></i>)</span></strong></span>
</li>
--------------------------------------------------------------------------------
<li class="bg-green">
<strong class="mob-hide">Discharged  <span class="discharged_per"></span></strong>
<strong class="mob-hide">3702595<span class="cup">     (78399<i class="fa fa-arrow-up"></i>)</span></strong>
<span class="mob-show">Discharged </span>
<span class="mob-show"><span class="discharged_per"></span> </span>
<span class="mob-show"><strong>3702595<span class="cup"><br/>(78399<i class="fa fa-arrow-up"></i>)</span></strong></span>
</li>
--------------------------------------------------------------------------------
<li class="bg-red">
<strong class="mob-hide">Deaths  <span class="death_per"></span></strong>
<strong class="mob-hide">78586     <span class="up"> (1114<i class="fa fa-arrow-up"></i>)</span></strong>
<span class="mob-show">Deaths </span>
<span class="mob-show"><span class="death_per"></span> </span>
<span class="mob-show"><strong>78586<span class="up"><br/>(1114<i class="fa fa-arrow-up"></i>)</span></strong></span>
<!--<span class='down'> <i class='fa fa-arrow-down'></i></span>-->
</li>
--------------------------------------------------------------------------------
I tried to get the div info and it seems the div ends with the first li tag. Below is the code . try running it once and you will see.
import requests
from bs4 import BeautifulSoup
def get_data_from_web():
print("here")
url = "http://mohfw.gov.in"
page = requests.get(url)
soup = BeautifulSoup(page.content, 'html.parser')
div = soup.find('div', class_='col-xs-8 site-stats-count')
li = div.find_all('li')
print(div)
get_data_from_web()
Here is the output -
<div class="col-xs-8 site-stats-count">
<ul style="margin-top:0px;">
<li class="bg-blue">
<strong class="mob-hide">Active <span class="active_per"></span></strong>
<strong class="mob-hide">973175<span class="up"> (14859<i class="fa fa-arrow-up"></i>)</span></strong>
<!--<span class='down'>3565 <i class='fa fa-arrow-down'></i></span>-->
<span class="mob-show">Active </span>
<span class="mob-show"><span class="active_per"></span> </span>
<span class="mob-show"><strong>973175<span class="up"><br/>(14859<i class="fa fa-arrow-up"></i>)</span></strong></span> </li></ul></div>

Same name under different class, get URL, BeautifulSoup Python

I have this HTML, and I need to get the URLs on it:
<div class="posts-container col-md-6"
<ul class="emb-embassies-list"
<a class="entry-title" href="commonlink.com"
<ul class="emb-embassies-list"
<a class="entry-title" href="rarelink.com"
<div class="col-md-6"
<ul class="emb-embassies-list"
<a class="entry-title" href="anothercommonlink.com"
<ul class="emb-embassies-list"
<a class="entry-title" href="legendarylink.com"
When I apply:
for i in soup.findAll('div', "posts-container col-md-6"):
for anchor in soup.findAll('a', class_="entry-title", href=True):
print(anchor['href'])
I get:
>commonlink.com
>rarelink.com
>anothercommonlink.com
>legendarylink.com
I only want to get the "posts-container col-md-6" ones:
>commonlink.com
>rarelink.com
To get all links under <div> with class="posts-container col-md-6" use CSS selector .posts-container.col-md-6 a:
from bs4 import BeautifulSoup
txt = '''
<div class="posts-container col-md-6">
<ul class="emb-embassies-list">
<a class="entry-title" href="commonlink.com">some link1</a>
<ul class="emb-embassies-list">
<a class="entry-title" href="rarelink.com">some link2</a>
</div>
<div class="col-md-6">
<ul class="emb-embassies-list">
<a class="entry-title" href="anothercommonlink.com">some link3</a>
<ul class="emb-embassies-list">
<a class="entry-title" href="legendarylink.com">some link4</a>
</div>'''
soup = BeautifulSoup(txt, 'html.parser')
for a in soup.select('.posts-container.col-md-6 a'):
print(a['href'])
Prints:
commonlink.com
rarelink.com
You can try it:
from bs4 import BeautifulSoup
html_doc = '''
<div class="posts-container col-md-6">
<ul class="emb-embassies-list">
<a class="entry-title" href="commonlink.com">some link1</a>
<ul class="emb-embassies-list">
<a class="entry-title" href="rarelink.com">some link2</a>
</div>
<div class="col-md-6">
<ul class="emb-embassies-list">
<a class="entry-title" href="anothercommonlink.com">some link3</a>
<ul class="emb-embassies-list">
<a class="entry-title" href="legendarylink.com">some link4</a>
</div>'''
soup = BeautifulSoup(html_doc, 'lxml')
ancors = soup.find('div', class_="posts-container col-md-6").find_all('a')
for a in ancors:
print(a['href'])
Output will be:
commonlink.com
rarelink.com

Extracting with .find() the second of 2 identical 'div' from html page with BS4

I'm trying to extract the second of 2 identical 'div' from a a soup element. When parsing trough and extracting with the .find() method, it gets exclusively the first from the top. How can I tell the script to skip the first and get the next one if some conditions are met? Here below is the html code I want to extract from.
<div class="a-row a-size-base a-color-secondary"><span>MPAA Rating: PG (Parental Guidance Suggested)</span></div>
</div>
</div></div>
<div class="sg-1"><div class="sg-2">
<div class="a-section a-spacing-none a-spacing-top-small">
<div class="a-row a-size-base a-color-base">
</div>
</div>
<div class="a-section a-spacing-none a-spacing-top-mini">
<div class="a-row a-size-base a-color-secondary"><span>$0.00 with a CONtv trial on Prime Video Channels</span></div>
</div>
This is the code I'm trying:
if '$' not in str(product.find('div', {'class': 'a-row a-size-base a-color-secondary'})):
print('NOT IN')
pass
price = product.find('div', {'class': 'a-row a-size-base a-color-secondary'})
print(price)
else:
price = product.find('div', {'class': 'a-row a-size-base a-color-secondary'})
print(price)
However as results it still gives me this:
NOT IN
<div class="a-row a-size-base a-color-secondary"><span>MPAA Rating: PG (Parental Guidance Suggested)</span></div>
Rather then this:
<div class="a-row a-size-base a-color-secondary"><span>$0.00 with a CONtv trial on Prime Video Channels</span></div>
Any suggestions?
You need find_all then index into returned list as find only ever returns first match. You can do same thing with select. With bs4 4.7.1. you can use :contains to target innerText of element by a substring (e.g. CONtv trial) and then use select_one if first match wanted or select if multiple matches. You want to test if None first before attempting to access .text
from bs4 import BeautifulSoup as bs
import requests
html = '''
<div class="a-row a-size-base a-color-secondary"><span>MPAA Rating: PG (Parental Guidance Suggested)</span></div>
</div>
</div></div>
<div class="sg-1"><div class="sg-2">
<div class="a-section a-spacing-none a-spacing-top-small">
<div class="a-row a-size-base a-color-base">
</div>
</div>
<div class="a-section a-spacing-none a-spacing-top-mini">
<div class="a-row a-size-base a-color-secondary"><span>$0.00 with a CONtv trial on Prime Video Channels</span></div>
</div>
'''
soup = bs(html, 'lxml')
print(soup.find_all('div', {'class': 'a-row a-size-base a-color-secondary'})[1].text)
print(soup.select('.a-color-secondary')[1].text)
print(soup.select_one('.a-color-secondary:contains("CONtv trial")').text)
Looping with find_all
matches = soup.find_all('div', {'class': 'a-row a-size-base a-color-secondary'})
for item in matches:
if '$' in str(item):
print(item.text)
Assuming the divs are now directly under the <body> you can use standard Python indexing. In your real code replace body in selector with appropriate element:
data = '''<div class="a-row a-size-base a-color-secondary"><span>MPAA Rating: PG (Parental Guidance Suggested)</span></div>
</div>
</div></div>
<div class="sg-1"><div class="sg-2">
<div class="a-section a-spacing-none a-spacing-top-small">
<div class="a-row a-size-base a-color-base">
</div>
</div>
<div class="a-section a-spacing-none a-spacing-top-mini">
<div class="a-row a-size-base a-color-secondary"><span>$0.00 with a CONtv trial on Prime Video Channels</span></div>
</div>'''
from bs4 import BeautifulSoup
import re
soup = BeautifulSoup(data, 'lxml')
print(soup.select('body > div')[1].text.strip())
Prints:
$0.00 with a CONtv trial on Prime Video Channels
Note the > sign in select() It means we want all <div> directly under the <body>.

Categories