Same name under different class, get URL, BeautifulSoup Python - python

I have this HTML, and I need to get the URLs on it:
<div class="posts-container col-md-6"
<ul class="emb-embassies-list"
<a class="entry-title" href="commonlink.com"
<ul class="emb-embassies-list"
<a class="entry-title" href="rarelink.com"
<div class="col-md-6"
<ul class="emb-embassies-list"
<a class="entry-title" href="anothercommonlink.com"
<ul class="emb-embassies-list"
<a class="entry-title" href="legendarylink.com"
When I apply:
for i in soup.findAll('div', "posts-container col-md-6"):
for anchor in soup.findAll('a', class_="entry-title", href=True):
print(anchor['href'])
I get:
>commonlink.com
>rarelink.com
>anothercommonlink.com
>legendarylink.com
I only want to get the "posts-container col-md-6" ones:
>commonlink.com
>rarelink.com

To get all links under <div> with class="posts-container col-md-6" use CSS selector .posts-container.col-md-6 a:
from bs4 import BeautifulSoup
txt = '''
<div class="posts-container col-md-6">
<ul class="emb-embassies-list">
<a class="entry-title" href="commonlink.com">some link1</a>
<ul class="emb-embassies-list">
<a class="entry-title" href="rarelink.com">some link2</a>
</div>
<div class="col-md-6">
<ul class="emb-embassies-list">
<a class="entry-title" href="anothercommonlink.com">some link3</a>
<ul class="emb-embassies-list">
<a class="entry-title" href="legendarylink.com">some link4</a>
</div>'''
soup = BeautifulSoup(txt, 'html.parser')
for a in soup.select('.posts-container.col-md-6 a'):
print(a['href'])
Prints:
commonlink.com
rarelink.com

You can try it:
from bs4 import BeautifulSoup
html_doc = '''
<div class="posts-container col-md-6">
<ul class="emb-embassies-list">
<a class="entry-title" href="commonlink.com">some link1</a>
<ul class="emb-embassies-list">
<a class="entry-title" href="rarelink.com">some link2</a>
</div>
<div class="col-md-6">
<ul class="emb-embassies-list">
<a class="entry-title" href="anothercommonlink.com">some link3</a>
<ul class="emb-embassies-list">
<a class="entry-title" href="legendarylink.com">some link4</a>
</div>'''
soup = BeautifulSoup(html_doc, 'lxml')
ancors = soup.find('div', class_="posts-container col-md-6").find_all('a')
for a in ancors:
print(a['href'])
Output will be:
commonlink.com
rarelink.com

Related

How do I web scrape the names of the production companies from IMDB website

I need to scrape the names of the Production Companies of some movies. I keep try by using the anchor tag a and the class in which the names are enclosed but it does not return the production companies.
URL : https://www.imdb.com/title/tt0473553/?ref_=fn_al_tt_1
Here's the HTML part of the website that I want to scrape :
<section class="ipc-page-section ipc-page-section--base">
<div data-testid="title-details-section" class="styles__MetaDataContainer-sc-12uhu9s-0 cgqHBf">
<ul>
<li role="presentation" class="ipc-metadata-list__item ipc-metadata-list-item--link" data-testid="title-details-companies"><a class="ipc-metadata-list-item__label ipc-metadata-list-item__label--link" rel="" href="/title/tt0473553/companycredits?ref_=tt_dt_co" target="">Production companies</a>
<div class="ipc-metadata-list-item__content-container">
<ul class="ipc-inline-list ipc-inline-list--show-dividers ipc-inline-list--inline ipc-metadata-list-item__list-content base" role="presentation">
<li role="presentation" class="ipc-inline-list__item">
<a class="ipc-metadata-list-item__list-content-item ipc-metadata-list-item__list-content-item--link" rel="" href="/company/co0136980?ref_=tt_dt_co_1">IDT Entertainment</a>
</li>
<li role="presentation" class="ipc-inline-list__item">
<a class="ipc-metadata-list-item__list-content-item ipc-metadata-list-item__list-content-item--link" rel="" href="/company/co0142161?ref_=tt_dt_co_2">New Arc Entertainment</a>
</li>
</ul>
</div>
</li>
</ul>
</div>
</section>
Here's, What I have tried :
import requests
from bs4 import BeautifulSoup
movie_url="https://www.imdb.com/title/tt0473553/?ref_=fn_al_tt_1"
movie_page = requests.get(movie_url)
soup = BeautifulSoup(page.text, 'html.parser')
#movies_comp = soup.find_all("li", class_="ipc-inline-list__item")
movies_comp = soup.find_all("a", class_="ipc-metadata-list-item__list-content-item ipc-metadata-list-item__list-content-item--link")
print(movies_comp)
I am not getting desirable output. What I am expecting it to return output is like:
['IDT Entertainment', 'New Arc Entertainment']
Here's what you can try :
import requests
from bs4 import BeautifulSoup
page=requests.get("https://www.imdb.com/title/tt0473553/?ref_=fn_al_tt_1")
page="""
<section class="ipc-page-section ipc-page-section--base">
<div data-testid="title-details-section" class="styles__MetaDataContainer-sc-12uhu9s-0 cgqHBf">
<ul>
<li role="presentation" class="ipc-metadata-list__item ipc-metadata-list-item--link" data-testid="title-details-companies"><a class="ipc-metadata-list-item__label ipc-metadata-list-item__label--link" rel="" href="/title/tt0473553/companycredits?ref_=tt_dt_co" target="">Production companies</a>
<div class="ipc-metadata-list-item__content-container">
<ul class="ipc-inline-list ipc-inline-list--show-dividers ipc-inline-list--inline ipc-metadata-list-item__list-content base" role="presentation">
<li role="presentation" class="ipc-inline-list__item">
<a class="ipc-metadata-list-item__list-content-item ipc-metadata-list-item__list-content-item--link" rel="" href="/company/co0136980?ref_=tt_dt_co_1">IDT Entertainment</a>
</li>
<li role="presentation" class="ipc-inline-list__item">
<a class="ipc-metadata-list-item__list-content-item ipc-metadata-list-item__list-content-item--link" rel="" href="/company/co0142161?ref_=tt_dt_co_2">New Arc Entertainment</a>
</li>
</ul>
</div>
</li>
</ul>
</div>
</section>
"""
soup=BeautifulSoup(page,"lxml")
# To understand this is then structur of the data you want to extract :
# <li role="presentation" class="ipc-metadata-list__item ipc-metadata-list-item--link" data-testid="title-details-companies">
# <ul class="ipc-inline-list ipc-inline-list--show-dividers ipc-inline-list--inline ipc-metadata-list-item__list-content base" role="presentation"><li role="presentation" class="ipc-inline-list__item"><a class="ipc-metadata-list-item__list-content-item ipc-metadata-list-item__list-content-item--link" rel="" href="/company/co0136980?ref_=tt_dt_co_1">
# <a class="ipc-metadata-list-item__list-content-item ipc-metadata-list-item__list-content-item--link" rel="" href="/company/co0136980?ref_=tt_dt_co_1">IDT Entertainment</a>
# <a class="ipc-metadata-list-item__list-content-item ipc-metadata-list-item__list-content-item--link" rel="" href="/company/co0142161?ref_=tt_dt_co_2">New Arc Entertainment</a>
print([a.text for a in soup.find("li",attrs={'class':r'ipc-metadata-list__item ipc-metadata-list-item--link','data-testid':r'title-details-companies'})
.find("ul",class_="ipc-inline-list ipc-inline-list--show-dividers ipc-inline-list--inline ipc-metadata-list-item__list-content base")
.find_all("a")])
Output :
['IDT Entertainment', 'New Arc Entertainment']
There are <a> with that class so, you are getting multiple of them.

Beautifulsoup get element with the same class

I'm having trouble parsing HTML elements with "class" attribute using Beautifulsoup.
The html code is like this :
<div class="info-item">
<div class="item-name">Model</div>
<div class="item-content">XPANDER 1.5L GLX</div>
</div>
<div class="info-item">
<div class="item-name">Transmission</div>
<div class="item-content"> MT </div>
</div>
<div class="info-item">
<div class="item-name">Engine Capacity (cc)</div>
<div class="item-content">1499 cc</div>
</div>
<div class="info-item">
<div class="item-name">Fuel</div>
<div class="item-content">Bensin </div>
</div>
I nead to get data (XPANDER 1.5L GLX, MT, 1499, Gasoline)
I try with script detail.find(class_='item-content') just only get XPANDER 1.5L GLX
please help
Use .find_all() or .select():
from bs4 import BeautifulSoup
html_doc = """
<div class="info-item">
<div class="item-name">Model</div>
<div class="item-content">XPANDER 1.5L GLX</div>
</div>
<div class="info-item">
<div class="item-name">Transmission</div>
<div class="item-content"> MT </div>
</div>
<div class="info-item">
<div class="item-name">Engine Capacity (cc)</div>
<div class="item-content">1499 cc</div>
</div>
<div class="info-item">
<div class="item-name">Fuel</div>
<div class="item-content">Bensin </div>
</div>
"""
soup = BeautifulSoup(html_doc, "html.parser")
items = [
item.get_text(strip=True) for item in soup.find_all(class_="item-content")
]
print(*items)
Prints:
XPANDER 1.5L GLX MT 1499 cc Bensin
Or:
items = [item.get_text(strip=True) for item in soup.select(".item-content")]
You can try this
soup = BeautifulSoup(html, "html.parser")
items = [item.text for item in soup.find_all("div", {"class": "item-content"})]
find_all retreives all occurences

how to print a particular value from html content?

below is HTML content I want the only value that is available in HTML content
<div class="list-group-item">
<div class="row">
<div class="col" style="min-width: 0;">
<h2 class="h5 mt-0 text-truncate">
<a class="text-warning" href="www.example.com">
Ram
</a>
</h2>
<p class="mob-9 text-truncate">
<small>
<i class="fa fa-fw fa-mobile-alt">
</i>
Contact:
</small>
010101010
</p>
<p class="mb-2 text-truncate">
<small>
<i class="fa fa-fw fa-map-marker-alt">
</i>
Location:
</small>
5th lane, kamathipura, Kamathipura
</p>
</a>
</p>
</div>
</div>
</div>
my code is -
import pandas as pd
import requests
from bs4 import BeautifulSoup as soup
url = requests.get("www.example.com")
page_soup = soup(url.content, 'html.parser')
name = shop.findAll("div", {"class": "list-group-item"})
print(name.h2.text)
number = shop.findAll("p", {"class": "fa fa-fw fa-map-marker-alt"})
print(?)
location = shop.findAll("p", {"class": "fa fa-fw fa-map-marker-alt"})
print(?)
I need output for this by using python -
'Ram', '010101010', '5th lane, kamathipura, Kamathipura'
Using the tags and class identifiers, you can grab all contents within the regions you want. Then with content indicies you should be able to select the exact content you wish like this:
from bs4 import BeautifulSoup
url = 'myhtml.html'
with open(url) as fp:
soup = BeautifulSoup(fp, 'html.parser')
contnt1 = [soup.find('a').contents[0].replace(' ','').replace('\n','')]
contnt2 = [x.contents[2].replace(' ', '').replace('\n', '') for x in soup.find_all("p", "text-truncate")]
print(*(contnt1 + contnt2))
Have you tried location.get_text() ?
You can go here and read more about it.

Python Scraping Unable to extract all <li> 's

import requests
from bs4 import BeautifulSoup
def get_data_from_web():
url = "http://mohfw.gov.in"
page = requests.get(url)
soup = BeautifulSoup(page.content, 'html.parser')
div = soup.find('div', class_='col-xs-8 site-stats-count')
li = div.find_all('li')
print(li)
get_data_from_web()
Im trying to extract Corona stats from http://mohfw.gov.in , but I'm getting only one first li
while there are total of 3 li,
I tried by giving class specifically for those li tags but I'm getting none response
<div class="col-xs-8 site-stats-count">
<ul style="margin-top:0px;">
<li class="bg-blue">
<strong class="mob-hide">Active <span class="active_per"></span></strong>
<strong class="mob-hide">973175<span class='up'> (14859<i class='fa fa-arrow-up'></i>)</span></strong>
<!--<span class='down'>3565 <i class='fa fa-arrow-down'></i></span>-->
<span class="mob-show">Active </span>
<span class="mob-show"><span class="active_per"></span> </span>
<span class="mob-show"><strong>973175<span class='up'><br>(14859<i class='fa fa-arrow-up'></i>)</span></strong></span> </span>
</li>
<li class="bg-green">
<strong class="mob-hide">Discharged <span class="discharged_per"></span></strong>
<strong class="mob-hide">3702595<span class='cup'> (78399<i class='fa fa-arrow-up'></i>)</span></strong>
<span class="mob-show">Discharged </span>
<span class="mob-show"><span class="discharged_per"></span> </span>
<span class="mob-show"><strong>3702595<span class='cup'><br>(78399<i class='fa fa-arrow-up'></i>)</span></strong></span> </span>
</li>
<li class="bg-red">
<strong class="mob-hide">Deaths <span class="death_per"></span></strong>
<strong class="mob-hide">78586 <span class='up'> (1114<i class='fa fa-arrow-up'></i>)</span></strong>
<span class="mob-show">Deaths </span>
<span class="mob-show"><span class="death_per"></span> </span>
<span class="mob-show"><strong>78586<span class='up'><br>(1114<i class='fa fa-arrow-up'></i>)</span></strong></span> </span>
<!--<span class='down'> <i class='fa fa-arrow-down'></i></span>-->
</li>
</ul></div>
The HTML markup on that page is broken, try to parse it with lxml or html5lib parsers:
import requests
from bs4 import BeautifulSoup
def get_data_from_web():
url = "http://mohfw.gov.in"
page = requests.get(url)
soup = BeautifulSoup(page.content, 'lxml') # <-- change to lxml or html5lib
div = soup.find('div', class_='col-xs-8 site-stats-count')
lis = div.find_all('li')
for li in lis:
print(li)
print('-' * 80)
get_data_from_web()
Prints:
<li class="bg-blue">
<strong class="mob-hide">Active  <span class="active_per"></span></strong>
<strong class="mob-hide">973175<span class="up">     (14859<i class="fa fa-arrow-up"></i>)</span></strong>
<!--<span class='down'>3565 <i class='fa fa-arrow-down'></i></span>-->
<span class="mob-show">Active </span>
<span class="mob-show"><span class="active_per"></span> </span>
<span class="mob-show"><strong>973175<span class="up"><br/>(14859<i class="fa fa-arrow-up"></i>)</span></strong></span>
</li>
--------------------------------------------------------------------------------
<li class="bg-green">
<strong class="mob-hide">Discharged  <span class="discharged_per"></span></strong>
<strong class="mob-hide">3702595<span class="cup">     (78399<i class="fa fa-arrow-up"></i>)</span></strong>
<span class="mob-show">Discharged </span>
<span class="mob-show"><span class="discharged_per"></span> </span>
<span class="mob-show"><strong>3702595<span class="cup"><br/>(78399<i class="fa fa-arrow-up"></i>)</span></strong></span>
</li>
--------------------------------------------------------------------------------
<li class="bg-red">
<strong class="mob-hide">Deaths  <span class="death_per"></span></strong>
<strong class="mob-hide">78586     <span class="up"> (1114<i class="fa fa-arrow-up"></i>)</span></strong>
<span class="mob-show">Deaths </span>
<span class="mob-show"><span class="death_per"></span> </span>
<span class="mob-show"><strong>78586<span class="up"><br/>(1114<i class="fa fa-arrow-up"></i>)</span></strong></span>
<!--<span class='down'> <i class='fa fa-arrow-down'></i></span>-->
</li>
--------------------------------------------------------------------------------
I tried to get the div info and it seems the div ends with the first li tag. Below is the code . try running it once and you will see.
import requests
from bs4 import BeautifulSoup
def get_data_from_web():
print("here")
url = "http://mohfw.gov.in"
page = requests.get(url)
soup = BeautifulSoup(page.content, 'html.parser')
div = soup.find('div', class_='col-xs-8 site-stats-count')
li = div.find_all('li')
print(div)
get_data_from_web()
Here is the output -
<div class="col-xs-8 site-stats-count">
<ul style="margin-top:0px;">
<li class="bg-blue">
<strong class="mob-hide">Active <span class="active_per"></span></strong>
<strong class="mob-hide">973175<span class="up"> (14859<i class="fa fa-arrow-up"></i>)</span></strong>
<!--<span class='down'>3565 <i class='fa fa-arrow-down'></i></span>-->
<span class="mob-show">Active </span>
<span class="mob-show"><span class="active_per"></span> </span>
<span class="mob-show"><strong>973175<span class="up"><br/>(14859<i class="fa fa-arrow-up"></i>)</span></strong></span> </li></ul></div>

Unable to scrape h1 class with python/beautiful soup

I am trying to scrape a title from an h1 class, but I keep getting "None"
page = requests.get(URL, headers=headers)
soup = BeautifulSoup(page.content, 'html.parser')
title = soup.find('h1', {'class': 'prod-name'})
print(title)
I've also tried using this way:
name_div = soup.find_all('div', {'class': 'col-md-12 col-sm-12 col-xs-12'})[0]
name = name_div.find('h1').text
print(name)
in which case I get: "IndexError: list index out of range"
Can anybody help me out?
This is the source code:
<div class="row attachDetails __web-inspector-hidebefore-shortcut__">
<div class="row">
<div class="col-md-12 col-sm-12 col-xs-12">
<div class="brand-desc">POLO RALPH LAUREN</div>
<h1 class="prod-name">ARAN CREWNECK SWEATER</h1>
<div class="panel-group" id="accordion">
<div class="borders-overview">
<div class="panel-heading">
<h4 class="panel-title">
<label class="overview-label collapsed" data-angle="overview-label" data-toggle="collapse" data-parent="#accordion" href="#collapse1">
<a class="fa fa-angle-up pull-right"></a>
<a class="over-view">OVERVIEW</a>
<span class="color-disp over-view">COLOR: FAWN GREY HEATHER</span>
<span class="style-num over-view">MATERIAL# : 710766783002
</span></label>
</h4>
</div>
<div id="collapse1" class="panel-collapse collapse">
<div class="short-desc-section"></div>
</div>
</div>
<div class="border-details">
<div class="panel-heading">
<h4 class="panel-title">
<label class="prod-details collapsed" data-angle="prod-details" data-toggle="collapse" data-parent="#accordion" href="#collapse2">
<a class="detail-link">Details</a>
<a class="fa fa-angle-up pull-right"></a>
</label>
</h4>
</div>
<div id="collapse2" class="long-desc panel-collapse collapse">
<div><ol><li>STANDARD FIT</li><li>COTTON</li></ol></div>
<ol>
<div><li><b>Board:</b> S196SC23</li></div>
<!--***********************************************************************************************************-->
</ol>
</div>
</div>
</div>
</div>
</div>
</div>

Categories