Python Scraping Unable to extract all <li> 's - python

import requests
from bs4 import BeautifulSoup
def get_data_from_web():
url = "http://mohfw.gov.in"
page = requests.get(url)
soup = BeautifulSoup(page.content, 'html.parser')
div = soup.find('div', class_='col-xs-8 site-stats-count')
li = div.find_all('li')
print(li)
get_data_from_web()
Im trying to extract Corona stats from http://mohfw.gov.in , but I'm getting only one first li
while there are total of 3 li,
I tried by giving class specifically for those li tags but I'm getting none response
<div class="col-xs-8 site-stats-count">
<ul style="margin-top:0px;">
<li class="bg-blue">
<strong class="mob-hide">Active <span class="active_per"></span></strong>
<strong class="mob-hide">973175<span class='up'> (14859<i class='fa fa-arrow-up'></i>)</span></strong>
<!--<span class='down'>3565 <i class='fa fa-arrow-down'></i></span>-->
<span class="mob-show">Active </span>
<span class="mob-show"><span class="active_per"></span> </span>
<span class="mob-show"><strong>973175<span class='up'><br>(14859<i class='fa fa-arrow-up'></i>)</span></strong></span> </span>
</li>
<li class="bg-green">
<strong class="mob-hide">Discharged <span class="discharged_per"></span></strong>
<strong class="mob-hide">3702595<span class='cup'> (78399<i class='fa fa-arrow-up'></i>)</span></strong>
<span class="mob-show">Discharged </span>
<span class="mob-show"><span class="discharged_per"></span> </span>
<span class="mob-show"><strong>3702595<span class='cup'><br>(78399<i class='fa fa-arrow-up'></i>)</span></strong></span> </span>
</li>
<li class="bg-red">
<strong class="mob-hide">Deaths <span class="death_per"></span></strong>
<strong class="mob-hide">78586 <span class='up'> (1114<i class='fa fa-arrow-up'></i>)</span></strong>
<span class="mob-show">Deaths </span>
<span class="mob-show"><span class="death_per"></span> </span>
<span class="mob-show"><strong>78586<span class='up'><br>(1114<i class='fa fa-arrow-up'></i>)</span></strong></span> </span>
<!--<span class='down'> <i class='fa fa-arrow-down'></i></span>-->
</li>
</ul></div>

The HTML markup on that page is broken, try to parse it with lxml or html5lib parsers:
import requests
from bs4 import BeautifulSoup
def get_data_from_web():
url = "http://mohfw.gov.in"
page = requests.get(url)
soup = BeautifulSoup(page.content, 'lxml') # <-- change to lxml or html5lib
div = soup.find('div', class_='col-xs-8 site-stats-count')
lis = div.find_all('li')
for li in lis:
print(li)
print('-' * 80)
get_data_from_web()
Prints:
<li class="bg-blue">
<strong class="mob-hide">Active  <span class="active_per"></span></strong>
<strong class="mob-hide">973175<span class="up">     (14859<i class="fa fa-arrow-up"></i>)</span></strong>
<!--<span class='down'>3565 <i class='fa fa-arrow-down'></i></span>-->
<span class="mob-show">Active </span>
<span class="mob-show"><span class="active_per"></span> </span>
<span class="mob-show"><strong>973175<span class="up"><br/>(14859<i class="fa fa-arrow-up"></i>)</span></strong></span>
</li>
--------------------------------------------------------------------------------
<li class="bg-green">
<strong class="mob-hide">Discharged  <span class="discharged_per"></span></strong>
<strong class="mob-hide">3702595<span class="cup">     (78399<i class="fa fa-arrow-up"></i>)</span></strong>
<span class="mob-show">Discharged </span>
<span class="mob-show"><span class="discharged_per"></span> </span>
<span class="mob-show"><strong>3702595<span class="cup"><br/>(78399<i class="fa fa-arrow-up"></i>)</span></strong></span>
</li>
--------------------------------------------------------------------------------
<li class="bg-red">
<strong class="mob-hide">Deaths  <span class="death_per"></span></strong>
<strong class="mob-hide">78586     <span class="up"> (1114<i class="fa fa-arrow-up"></i>)</span></strong>
<span class="mob-show">Deaths </span>
<span class="mob-show"><span class="death_per"></span> </span>
<span class="mob-show"><strong>78586<span class="up"><br/>(1114<i class="fa fa-arrow-up"></i>)</span></strong></span>
<!--<span class='down'> <i class='fa fa-arrow-down'></i></span>-->
</li>
--------------------------------------------------------------------------------

I tried to get the div info and it seems the div ends with the first li tag. Below is the code . try running it once and you will see.
import requests
from bs4 import BeautifulSoup
def get_data_from_web():
print("here")
url = "http://mohfw.gov.in"
page = requests.get(url)
soup = BeautifulSoup(page.content, 'html.parser')
div = soup.find('div', class_='col-xs-8 site-stats-count')
li = div.find_all('li')
print(div)
get_data_from_web()
Here is the output -
<div class="col-xs-8 site-stats-count">
<ul style="margin-top:0px;">
<li class="bg-blue">
<strong class="mob-hide">Active <span class="active_per"></span></strong>
<strong class="mob-hide">973175<span class="up"> (14859<i class="fa fa-arrow-up"></i>)</span></strong>
<!--<span class='down'>3565 <i class='fa fa-arrow-down'></i></span>-->
<span class="mob-show">Active </span>
<span class="mob-show"><span class="active_per"></span> </span>
<span class="mob-show"><strong>973175<span class="up"><br/>(14859<i class="fa fa-arrow-up"></i>)</span></strong></span> </li></ul></div>

Related

How do I web scrape the names of the production companies from IMDB website

I need to scrape the names of the Production Companies of some movies. I keep try by using the anchor tag a and the class in which the names are enclosed but it does not return the production companies.
URL : https://www.imdb.com/title/tt0473553/?ref_=fn_al_tt_1
Here's the HTML part of the website that I want to scrape :
<section class="ipc-page-section ipc-page-section--base">
<div data-testid="title-details-section" class="styles__MetaDataContainer-sc-12uhu9s-0 cgqHBf">
<ul>
<li role="presentation" class="ipc-metadata-list__item ipc-metadata-list-item--link" data-testid="title-details-companies"><a class="ipc-metadata-list-item__label ipc-metadata-list-item__label--link" rel="" href="/title/tt0473553/companycredits?ref_=tt_dt_co" target="">Production companies</a>
<div class="ipc-metadata-list-item__content-container">
<ul class="ipc-inline-list ipc-inline-list--show-dividers ipc-inline-list--inline ipc-metadata-list-item__list-content base" role="presentation">
<li role="presentation" class="ipc-inline-list__item">
<a class="ipc-metadata-list-item__list-content-item ipc-metadata-list-item__list-content-item--link" rel="" href="/company/co0136980?ref_=tt_dt_co_1">IDT Entertainment</a>
</li>
<li role="presentation" class="ipc-inline-list__item">
<a class="ipc-metadata-list-item__list-content-item ipc-metadata-list-item__list-content-item--link" rel="" href="/company/co0142161?ref_=tt_dt_co_2">New Arc Entertainment</a>
</li>
</ul>
</div>
</li>
</ul>
</div>
</section>
Here's, What I have tried :
import requests
from bs4 import BeautifulSoup
movie_url="https://www.imdb.com/title/tt0473553/?ref_=fn_al_tt_1"
movie_page = requests.get(movie_url)
soup = BeautifulSoup(page.text, 'html.parser')
#movies_comp = soup.find_all("li", class_="ipc-inline-list__item")
movies_comp = soup.find_all("a", class_="ipc-metadata-list-item__list-content-item ipc-metadata-list-item__list-content-item--link")
print(movies_comp)
I am not getting desirable output. What I am expecting it to return output is like:
['IDT Entertainment', 'New Arc Entertainment']
Here's what you can try :
import requests
from bs4 import BeautifulSoup
page=requests.get("https://www.imdb.com/title/tt0473553/?ref_=fn_al_tt_1")
page="""
<section class="ipc-page-section ipc-page-section--base">
<div data-testid="title-details-section" class="styles__MetaDataContainer-sc-12uhu9s-0 cgqHBf">
<ul>
<li role="presentation" class="ipc-metadata-list__item ipc-metadata-list-item--link" data-testid="title-details-companies"><a class="ipc-metadata-list-item__label ipc-metadata-list-item__label--link" rel="" href="/title/tt0473553/companycredits?ref_=tt_dt_co" target="">Production companies</a>
<div class="ipc-metadata-list-item__content-container">
<ul class="ipc-inline-list ipc-inline-list--show-dividers ipc-inline-list--inline ipc-metadata-list-item__list-content base" role="presentation">
<li role="presentation" class="ipc-inline-list__item">
<a class="ipc-metadata-list-item__list-content-item ipc-metadata-list-item__list-content-item--link" rel="" href="/company/co0136980?ref_=tt_dt_co_1">IDT Entertainment</a>
</li>
<li role="presentation" class="ipc-inline-list__item">
<a class="ipc-metadata-list-item__list-content-item ipc-metadata-list-item__list-content-item--link" rel="" href="/company/co0142161?ref_=tt_dt_co_2">New Arc Entertainment</a>
</li>
</ul>
</div>
</li>
</ul>
</div>
</section>
"""
soup=BeautifulSoup(page,"lxml")
# To understand this is then structur of the data you want to extract :
# <li role="presentation" class="ipc-metadata-list__item ipc-metadata-list-item--link" data-testid="title-details-companies">
# <ul class="ipc-inline-list ipc-inline-list--show-dividers ipc-inline-list--inline ipc-metadata-list-item__list-content base" role="presentation"><li role="presentation" class="ipc-inline-list__item"><a class="ipc-metadata-list-item__list-content-item ipc-metadata-list-item__list-content-item--link" rel="" href="/company/co0136980?ref_=tt_dt_co_1">
# <a class="ipc-metadata-list-item__list-content-item ipc-metadata-list-item__list-content-item--link" rel="" href="/company/co0136980?ref_=tt_dt_co_1">IDT Entertainment</a>
# <a class="ipc-metadata-list-item__list-content-item ipc-metadata-list-item__list-content-item--link" rel="" href="/company/co0142161?ref_=tt_dt_co_2">New Arc Entertainment</a>
print([a.text for a in soup.find("li",attrs={'class':r'ipc-metadata-list__item ipc-metadata-list-item--link','data-testid':r'title-details-companies'})
.find("ul",class_="ipc-inline-list ipc-inline-list--show-dividers ipc-inline-list--inline ipc-metadata-list-item__list-content base")
.find_all("a")])
Output :
['IDT Entertainment', 'New Arc Entertainment']
There are <a> with that class so, you are getting multiple of them.

Python How to finde the right value with soup

I am trying to get the proce of an item from the following html.
This is the src
<div class="a-section a-spacing-small a-spacing-top-small">
<span class="a-declarative" data-action="show-all-offers-display" data-show-all-offers-display="{}">
<a class="a-link-normal" href="/gp/offer-listing/B08HLZXHZY/ref=dp_olp_NEW_mbc?ie=UTF8&condition=NEW">
<span>Neu (3) ab </span><span class="a-size-base a-color-price">1.930,99 €</span>
</a>
</span>
<span class="a-size-base a-color-base">& <b>Kostenlose Lieferung</b></span>
</div>
This is the code that I tried
html = """\
HTML Code here from the top.
"""
soup = Soup(html)
soup.find("span", {"a-size-base a-color-price": ""}).text
There are number of issues in your code. See below:
html = """<div class="a-section a-spacing-small a-spacing-top-small">
<span class="a-declarative" data-action="show-all-offers-display" data-show-all-offers-display="{}">
<a class="a-link-normal" href="/gp/offer-listing/B08HLZXHZY/ref=dp_olp_NEW_mbc?ie=UTF8&condition=NEW">
<span>Neu (3) ab </span><span class="a-size-base a-color-price">1.930,99 €</span>
</a>
</span>
<span class="a-size-base a-color-base">& <b>Kostenlose Lieferung</b></span>
</div>
"""
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'html.parser')
print(soup.find("span", {"class":"a-size-base a-color-price"}).text.strip())
output
1.930,99 €

how to print a particular value from html content?

below is HTML content I want the only value that is available in HTML content
<div class="list-group-item">
<div class="row">
<div class="col" style="min-width: 0;">
<h2 class="h5 mt-0 text-truncate">
<a class="text-warning" href="www.example.com">
Ram
</a>
</h2>
<p class="mob-9 text-truncate">
<small>
<i class="fa fa-fw fa-mobile-alt">
</i>
Contact:
</small>
010101010
</p>
<p class="mb-2 text-truncate">
<small>
<i class="fa fa-fw fa-map-marker-alt">
</i>
Location:
</small>
5th lane, kamathipura, Kamathipura
</p>
</a>
</p>
</div>
</div>
</div>
my code is -
import pandas as pd
import requests
from bs4 import BeautifulSoup as soup
url = requests.get("www.example.com")
page_soup = soup(url.content, 'html.parser')
name = shop.findAll("div", {"class": "list-group-item"})
print(name.h2.text)
number = shop.findAll("p", {"class": "fa fa-fw fa-map-marker-alt"})
print(?)
location = shop.findAll("p", {"class": "fa fa-fw fa-map-marker-alt"})
print(?)
I need output for this by using python -
'Ram', '010101010', '5th lane, kamathipura, Kamathipura'
Using the tags and class identifiers, you can grab all contents within the regions you want. Then with content indicies you should be able to select the exact content you wish like this:
from bs4 import BeautifulSoup
url = 'myhtml.html'
with open(url) as fp:
soup = BeautifulSoup(fp, 'html.parser')
contnt1 = [soup.find('a').contents[0].replace(' ','').replace('\n','')]
contnt2 = [x.contents[2].replace(' ', '').replace('\n', '') for x in soup.find_all("p", "text-truncate")]
print(*(contnt1 + contnt2))
Have you tried location.get_text() ?
You can go here and read more about it.

Same name under different class, get URL, BeautifulSoup Python

I have this HTML, and I need to get the URLs on it:
<div class="posts-container col-md-6"
<ul class="emb-embassies-list"
<a class="entry-title" href="commonlink.com"
<ul class="emb-embassies-list"
<a class="entry-title" href="rarelink.com"
<div class="col-md-6"
<ul class="emb-embassies-list"
<a class="entry-title" href="anothercommonlink.com"
<ul class="emb-embassies-list"
<a class="entry-title" href="legendarylink.com"
When I apply:
for i in soup.findAll('div', "posts-container col-md-6"):
for anchor in soup.findAll('a', class_="entry-title", href=True):
print(anchor['href'])
I get:
>commonlink.com
>rarelink.com
>anothercommonlink.com
>legendarylink.com
I only want to get the "posts-container col-md-6" ones:
>commonlink.com
>rarelink.com
To get all links under <div> with class="posts-container col-md-6" use CSS selector .posts-container.col-md-6 a:
from bs4 import BeautifulSoup
txt = '''
<div class="posts-container col-md-6">
<ul class="emb-embassies-list">
<a class="entry-title" href="commonlink.com">some link1</a>
<ul class="emb-embassies-list">
<a class="entry-title" href="rarelink.com">some link2</a>
</div>
<div class="col-md-6">
<ul class="emb-embassies-list">
<a class="entry-title" href="anothercommonlink.com">some link3</a>
<ul class="emb-embassies-list">
<a class="entry-title" href="legendarylink.com">some link4</a>
</div>'''
soup = BeautifulSoup(txt, 'html.parser')
for a in soup.select('.posts-container.col-md-6 a'):
print(a['href'])
Prints:
commonlink.com
rarelink.com
You can try it:
from bs4 import BeautifulSoup
html_doc = '''
<div class="posts-container col-md-6">
<ul class="emb-embassies-list">
<a class="entry-title" href="commonlink.com">some link1</a>
<ul class="emb-embassies-list">
<a class="entry-title" href="rarelink.com">some link2</a>
</div>
<div class="col-md-6">
<ul class="emb-embassies-list">
<a class="entry-title" href="anothercommonlink.com">some link3</a>
<ul class="emb-embassies-list">
<a class="entry-title" href="legendarylink.com">some link4</a>
</div>'''
soup = BeautifulSoup(html_doc, 'lxml')
ancors = soup.find('div', class_="posts-container col-md-6").find_all('a')
for a in ancors:
print(a['href'])
Output will be:
commonlink.com
rarelink.com

How to obtain text from list of elements in Selenium Python?

Consider the following HTML code.
<div id="tree" style="max-height: 300px; overflow-y: scroll;" class="treeview">
<ul class="list-group">
<li class="list-group-item node-tree" data-nodeid="0" style="color:undefined;background-color:undefined;">
<span class="icon glyphicon"></span>
<span class="icon node-icon glyphicon glyphicon-unchecked">
</span>
Apples
</li>
<li class="list-group-item node-tree" data-nodeid="1" style="color:undefined;background-color:undefined;">
<span class="icon glyphicon"></span>
<span class="icon node-icon glyphicon glyphicon-unchecked">
</span>
Bananas
</li>
<li class="list-group-item node-tree" data-nodeid="2" style="color:undefined;background-color:undefined;">
<span class="icon glyphicon"></span>
<span class="icon node-icon glyphicon glyphicon-unchecked">
</span>
Mangoes
</li>
<li class="list-group-item node-tree" data-nodeid="3" style="color:undefined;background-color:undefined;">
<span class="icon glyphicon"></span>
<span class="icon node-icon glyphicon glyphicon-unchecked">
</span>
Grapes
</li>
</ul>
</div>
I tried to obtain the text "Apples", "Bananas", "Mangoes", and "Grapes" using
i = 1
m = 2
while i < m:
css_id = ".list-group-item:nth-child(" + str(i) + ") > .node-icon"
try:
fruit_text = driver.find_element(By.CSS_SELECTOR, css_id).text
print(fruit_text)
i += 1
m += 1
except:
i += 1
but resulting no strings were printed.
Any ideas to solve it?
This should work for you :
fruits_list = driver.find_elements_by_css_selector('.list-group-item.node-tree')
for fruit in fruits_list:
print(fruit.text)
Just make sure that .list-group-item.node-tree this css selector should indicates fruits on the web page.

Categories