Beautifulsoup get element with the same class

Beautifulsoup get element with the same class - python

I'm having trouble parsing HTML elements with "class" attribute using Beautifulsoup.
The html code is like this :
<div class="info-item">
<div class="item-name">Model</div>
<div class="item-content">XPANDER 1.5L GLX</div>
</div>
<div class="info-item">
<div class="item-name">Transmission</div>
<div class="item-content"> MT </div>
</div>
<div class="info-item">
<div class="item-name">Engine Capacity (cc)</div>
<div class="item-content">1499 cc</div>
</div>
<div class="info-item">
<div class="item-name">Fuel</div>
<div class="item-content">Bensin </div>
</div>
I nead to get data (XPANDER 1.5L GLX, MT, 1499, Gasoline)
I try with script detail.find(class_='item-content') just only get XPANDER 1.5L GLX
please help

Use .find_all() or .select():
from bs4 import BeautifulSoup
html_doc = """
<div class="info-item">
<div class="item-name">Model</div>
<div class="item-content">XPANDER 1.5L GLX</div>
</div>
<div class="info-item">
<div class="item-name">Transmission</div>
<div class="item-content"> MT </div>
</div>
<div class="info-item">
<div class="item-name">Engine Capacity (cc)</div>
<div class="item-content">1499 cc</div>
</div>
<div class="info-item">
<div class="item-name">Fuel</div>
<div class="item-content">Bensin </div>
</div>
"""
soup = BeautifulSoup(html_doc, "html.parser")
items = [
item.get_text(strip=True) for item in soup.find_all(class_="item-content")
]
print(*items)
Prints:
XPANDER 1.5L GLX MT 1499 cc Bensin
Or:
items = [item.get_text(strip=True) for item in soup.select(".item-content")]

You can try this
soup = BeautifulSoup(html, "html.parser")
items = [item.text for item in soup.find_all("div", {"class": "item-content"})]
find_all retreives all occurences

Related

Unable to scrape h1 class with python/beautiful soup

I am trying to scrape a title from an h1 class, but I keep getting "None"
page = requests.get(URL, headers=headers)
soup = BeautifulSoup(page.content, 'html.parser')
title = soup.find('h1', {'class': 'prod-name'})
print(title)
I've also tried using this way:
name_div = soup.find_all('div', {'class': 'col-md-12 col-sm-12 col-xs-12'})[0]
name = name_div.find('h1').text
print(name)
in which case I get: "IndexError: list index out of range"
Can anybody help me out?
This is the source code:
<div class="row attachDetails __web-inspector-hidebefore-shortcut__">
<div class="row">
<div class="col-md-12 col-sm-12 col-xs-12">
<div class="brand-desc">POLO RALPH LAUREN</div>
<h1 class="prod-name">ARAN CREWNECK SWEATER</h1>
<div class="panel-group" id="accordion">
<div class="borders-overview">
<div class="panel-heading">
<h4 class="panel-title">
<label class="overview-label collapsed" data-angle="overview-label" data-toggle="collapse" data-parent="#accordion" href="#collapse1">
<a class="fa fa-angle-up pull-right"></a>
<a class="over-view">OVERVIEW</a>
<span class="color-disp over-view">COLOR: FAWN GREY HEATHER</span>
<span class="style-num over-view">MATERIAL# : 710766783002
</span></label>
</h4>
</div>
<div id="collapse1" class="panel-collapse collapse">
<div class="short-desc-section"></div>
</div>
</div>
<div class="border-details">
<div class="panel-heading">
<h4 class="panel-title">
<label class="prod-details collapsed" data-angle="prod-details" data-toggle="collapse" data-parent="#accordion" href="#collapse2">
<a class="detail-link">Details</a>
<a class="fa fa-angle-up pull-right"></a>
</label>
</h4>
</div>
<div id="collapse2" class="long-desc panel-collapse collapse">
<div><ol><li>STANDARD FIT</li><li>COTTON</li></ol></div>
<ol>
<div><li><b>Board:</b> S196SC23</li></div>
<!--***********************************************************************************************************-->
</ol>
</div>
</div>
</div>
</div>
</div>
</div>

Parsing all elements which have tag before

I have following html code:
<div class="1">
<fieldset>
<legend>AAA</legend>
<div class="row">aaa</div>
<div class="row">aaa</div>
<div class="row">aaa</div>
...
</fieldset>
</div>
<div class="1">
<fieldset>
<legend>BBB</legend>
<div class="row">bbb</div>
<div class="row">bbb</div>
<div class="row">bbb</div>
...
</fieldset>
</div>
I'm trying to display only the text inside all rows, where parent tag is legend BBB (in this example - bbb,bbb,bbb).
Currently I've created the code below, but it doesn't look pretty, and I don't know how to find all rows:
bs = BeautifulSoup(request.txt, 'html.parser')
if(bs.find('legend', text='BBB')):
value = parser.find('legend').next_element.next_element.next_element.get_text().strip()
print(value)
Is there any simply way to do this? div class name is the same, just "legend" is variable.

Added a <legend>CCC</legend> so that you may see it scales.
html = """<div class="1">
<fieldset>
<legend>AAA</legend>
<div class="row">aaa</div>
<div class="row">aaa</div>
<div class="row">aaa</div>
...
</fieldset>
</div>
<div class="1">
<fieldset>
<legend>BBB</legend>
<div class="row">bbb</div>
<div class="row">bbb</div>
<div class="row">bbb</div>
...
</fieldset>
</div>
<div class="1">
<fieldset>
<legend>CCC</legend>
<div class="row">ccc</div>
<div class="row">ccc</div>
<div class="row">ccc</div>
...
</fieldset>
</div>"""
after_tag = bs.find("legend", text="BBB").parent # Grabs parent div <fieldset>.
divs = after_tag.find_all("div", {"class": "row"}) # Finds all div inside parent.
for div in divs:
print(div.text)
bbb
bbb
bbb

from bs4 import BeautifulSoup
html = """
<div class="1">
<fieldset>
<legend>AAA</legend>
<div class="row">aaa</div>
<div class="row">aaa</div>
<div class="row">aaa</div>
...
</fieldset>
</div>
<div class="1">
<fieldset>
<legend>BBB</legend>
<div class="row">bbb</div>
<div class="row">bbb</div>
<div class="row">bbb</div>
...
</fieldset>
</div>
"""
soup = BeautifulSoup(html, features='html.parser')
elements = soup.select('div > fieldset')[1]
tuple_obj = ()
for row in elements.select('div.row'):
tuple_obj = tuple_obj + (row.text,)
print(tuple_obj)
the tuple object prints out
('bbb', 'bbb', 'bbb')

Python BeautifulSoup No Output

I'm testing out BeautifulSoup through Python and I only see this '[]' sign when I print this code:
import requests
from bs4 import BeautifulSoup
page = requests.get('https://www.theweathernetwork.com/ca/weather/british-columbia/vancouver')
soup = BeautifulSoup(page.content, 'html.parser')
week = soup.find(id='seven-day-periods')
# print(week)
print(week.find_all('li'))
Any help would be appreciated. Thank you!

There are no li as you can see from the weeks content:
<div class="sevenDay" id="seven-day-periods">
<!-- Legend: show only when data is loaded -->
<div class="wx_legend wx_legend_hidden">
<div class="wxRow wx_detailed-metrics">
<div class="legendColumn">Feels like</div>
</div>
<div class="wxRow wx_detailed-metrics daytime">
<div class="legendColumn">Night</div>
</div>
<div class="wxRow wx_detailed-metrics nighttime">
<div class="legendColumn">Day</div>
</div>
<div class="wxRow wx_detailed-metrics">
<div class="legendColumn">POP</div>
</div>
<div class="wxRow wx_detailed-metrics">
<div class="legendColumn">Wind ()</div>
</div>
<div class="wxRow wx_detailed-metrics">
<div class="legendColumn">Wind gust ()</div>
</div>
<div class="wxRow wx_detailed-metrics daytime">
<div class="legendColumn">Hrs of Sun</div>
</div>
<div class="top-divider2"> </div>
</div>
<div class="divTableBody">
</div>
You may have got it mixed up when it is displayed in html. I believe what you are trying to obtain is the values inside legend column. This can be obtained using:
for x in week.find_all('div','legendColumn'):
print(x.findAll(text=True))
Thus now your code will be
import requests
from bs4 import BeautifulSoup
page = requests.get('https://www.theweathernetwork.com/ca/weather/british-columbia/vancouver')
soup = BeautifulSoup(page.content, 'html.parser')
week = soup.find(id='seven-day-periods')
for x in week.find_all('div','legendColumn'):
print(x.findAll(text=True))
Where the output is
['Feels like']
['Night']
['Day']
['POP']
['Wind ()']
['Wind gust ()']
['Hrs of Sun']

How to scrape same class name data

I was trying to scrape some real estate websites but the one I came across has same class name under one div and that div has also 2 more div which has same class name. I want to scrape child class data (I think).
I want to scrape below class data:
<div class="m-srp-card__summary__info">New Property</div>
Below is the whole block of code I'm trying to scrape:
<div class="m-srp-card__collapse js-collapse" aria-collapsed="collapsed" data-container="srp-card-
summary">
<div class="m-srp-card__summary js-collapse__content" data-content="srp-card-summary">
<input type="hidden" id="propertyArea42679361" value="888 sqft">
<div class="m-srp-card__summary__item">
<div class="m-srp-card__summary__title">carpet area</div>
<div class="m-srp-card__summary__info">888 sqft</div>
</div>
<div class="m-srp-card__summary__item">
<div class="m-srp-card__summary__title">status</div>
<div class="m-srp-card__summary__info">Ready to Move</div>
</div>
<div class="m-srp-card__summary__item">
<div class="m-srp-card__summary__title">floor</div>
<div class="m-srp-card__summary__info">9 out of 13 floors</div>
</div>
<div class="m-srp-card__summary__item">
<div class="m-srp-card__summary__title">transaction</div>
<div class="m-srp-card__summary__info">New Property</div>
</div>
<div class="m-srp-card__summary__item">
<div class="m-srp-card__summary__title">furnishing</div>
<div class="m-srp-card__summary__info">Unfurnished</div>
</div>
<div class="m-srp-card__summary__item">
<div class="m-srp-card__summary__title">facing</div>
<div class="m-srp-card__summary__info">South -West</div>
</div>
<div class="m-srp-card__summary__item">
<div class="m-srp-card__summary__title">overlooking</div>
<div class="m-srp-card__summary__info">Garden/Park, Main Road</div>
</div>
<div class="m-srp-card__summary__item">
<div class="m-srp-card__summary__title">society</div>
<div class="m-srp-card__summary__info">
<a id="project-link-42679361" class="m-srp-card__summary__link"
href="https://www.magicbricks.com/skylights-bopal-ahmedabad-pdpid-4d4235303936323633"
target="_blank">Skylights</a>
</div>
</div>
<div class="m-srp-card__summary__item">
<div class="m-srp-card__summary__title">car parking</div>
<div class="m-srp-card__summary__info">1 Covered</div>
</div>
<div class="m-srp-card__summary__item">
<div class="m-srp-card__summary__title">bathroom</div>
<div class="m-srp-card__summary__info">3</div>
</div>
<div class="m-srp-card__summary__item">
<div class="m-srp-card__summary__title">balcony</div>
<div class="m-srp-card__summary__info">2</div>
</div>
<div class="m-srp-card__summary__item">
<div class="m-srp-card__summary__title">ownership</div>
<div class="m-srp-card__summary__info">Co-operative Society</div>
</div>
</div>
<div class="m-srp-card__collapse__control js-collapse__control" data-toggle="list-collapse"
data-target="srp-card-summary" onclick="stopPage=true;">
<div class="ico m-srp-card__ico">
<svg role="icon">
<use xlink:href="#icon-caret-down"></use>
</svg>
</div>
I tried Indexing but got nothing.
Below is my code:
req = Request('https://www.magicbricks.com/property-for-sale/residential-real-estate?proptype=Multistorey-Apartment,Builder-Floor-Apartment,Penthouse,Studio-Apartment,Residential-House,Villa&Locality=Bopal&cityName=Ahmedabad', headers={'User-Agent': 'Mozilla/5.0'})
webpage = urlopen(req).read()
soup = BeautifulSoup(req, 'html.parser')
containers = soup.find_all('div', {'class': 'm-srp-card__desc flex__item'})
container = containers[0]
no_apartment = container.find('h3').find('span', {'class': 'm-srp-card__title__bhk'}).getText()
c_area = container.find('div', {'class': 'm-srp-card__summary__info'}).getText()
p_price = container.find('div', {'class': 'm-srp-card__info flex__item'})
p_type = container.find('div', {'class': 'm-srp-card__summary js-collapse__content'})[3].find('div', {'class': 'm-srp-card__summary__info'})
Thanks in advance!

import requests
from bs4 import BeautifulSoup
import csv
import re
r = requests.get('https://www.magicbricks.com/property-for-sale/residential-real-estate?proptype=Multistorey-Apartment,Builder-Floor-Apartment,Penthouse,Studio-Apartment,Residential-House,Villa&Locality=Bopal&cityName=Ahmedabad')
soup = BeautifulSoup(r.text, 'html.parser')
category = []
size = []
price = []
floor = []
for item in soup.findAll('span', {'class': 'm-srp-card__title__bhk'}):
category.append(item.get_text(strip=True))
for item in soup.findAll(text=re.compile('area$')):
size.append(item.find_next('div').text)
for item in soup.findAll('span', {'class': 'm-srp-card__price'}):
price.append(item.text)
for item in soup.findAll(text='floor'):
floor.append(item.find_next('div').text)
data = []
for items in zip(category, size, price, floor):
data.append(items)
with open('output.csv', 'w+', newline='', encoding='UTF-8-SIG') as file:
writer = csv.writer(file)
writer.writerow(['Category', 'Size', 'Price', 'Floor'])
writer.writerows(data)
print("Operation Completed")
View Output Online: click here

Scrapy select HTML elements that have specific attribute name

There is this HTML:
<div>
<div data-id="1"> </div>
<div data-id="2"> </div>
<div data-id="3"> </div>
...
<div> </div>
</div>
I need to select the inner div that have the attribute data-id (regardless of values) only. How do I achieve that with Scrapy?

You can use the following
response.css('div[data-id]').extract()
It will give you a list of all divs with data-id attribute.
[u'<div data-id="1"> </div>',
u'<div data-id="2"> </div>',
u'<div data-id="3"> </div>']

Use BeautifulSoup. Code
from bs4 import BeautifulSoup
soup = BeautifulSoup("""<div> <div data-id="1"> </div> <div data-id="2"> </div> <div data-id="3"> </div><div> </div> </div>""")
print(soup.find_all("div", {"data-id":True}))
OUTPUT:
[<div data-id="1"> </div>, <div data-id="2"> </div>, <div data-id="3"> </div>]
You can specify which attribute to be present in find or find_all with the value as True

<li class="gb_i" aria-grabbed="false">
<a class="gb_d" data-pid="192" draggable="false" href="xyz.com" id="gb192">
<div data-class="gb_u"></div>
<div data-class="gb_v"></div>
<div data-class="gb_w"></div>
<div data-class="gb_x"></div>
</a>
</li>
Take look and above example HTML code.
To get all div containing data-class in Scrapy v1.6+
response.xpath('//a[#data-pid="192"]/div[contains(#data-class,"")]').getall()
In scrapy version <1.6 you can use extract() in place of getall().
Hope this helps

scrapy shell
In [1]: b = '''
...: <div>
...: <div data-id="1">gdfg </div>
...: <div data-id="2">dgdfg </div>
...: <div data-id="3">asdasd </div>
...: <div> </div>
...: </div>
...: '''
In [2]: from scrapy import Selector
In [3]: sel = Selector(text=b, type="html")
In [4]: sel.xpath('//div[re:test(#data-id,"\d")]/text()').extract()
Out[4]: ['gdfg ', 'dgdfg ', 'asdasd ']

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Beautifulsoup get element with the same class - python

You can try this soup = BeautifulSoup(html, "html.parser") items = [item.text for item in soup.find_all("div", {"class": "item-content"})] find_all retreives all occurences

Related

Unable to scrape h1 class with python/beautiful soup

Parsing all elements which have tag before

Python BeautifulSoup No Output

How to scrape same class name data

Scrapy select HTML elements that have specific attribute name

Categories

Resources