I'm having trouble parsing HTML elements with "class" attribute using Beautifulsoup.
The html code is like this :
<div class="info-item">
<div class="item-name">Model</div>
<div class="item-content">XPANDER 1.5L GLX</div>
</div>
<div class="info-item">
<div class="item-name">Transmission</div>
<div class="item-content"> MT </div>
</div>
<div class="info-item">
<div class="item-name">Engine Capacity (cc)</div>
<div class="item-content">1499 cc</div>
</div>
<div class="info-item">
<div class="item-name">Fuel</div>
<div class="item-content">Bensin </div>
</div>
I nead to get data (XPANDER 1.5L GLX, MT, 1499, Gasoline)
I try with script detail.find(class_='item-content') just only get XPANDER 1.5L GLX
please help
Use .find_all() or .select():
from bs4 import BeautifulSoup
html_doc = """
<div class="info-item">
<div class="item-name">Model</div>
<div class="item-content">XPANDER 1.5L GLX</div>
</div>
<div class="info-item">
<div class="item-name">Transmission</div>
<div class="item-content"> MT </div>
</div>
<div class="info-item">
<div class="item-name">Engine Capacity (cc)</div>
<div class="item-content">1499 cc</div>
</div>
<div class="info-item">
<div class="item-name">Fuel</div>
<div class="item-content">Bensin </div>
</div>
"""
soup = BeautifulSoup(html_doc, "html.parser")
items = [
item.get_text(strip=True) for item in soup.find_all(class_="item-content")
]
print(*items)
Prints:
XPANDER 1.5L GLX MT 1499 cc Bensin
Or:
items = [item.get_text(strip=True) for item in soup.select(".item-content")]
You can try this
soup = BeautifulSoup(html, "html.parser")
items = [item.text for item in soup.find_all("div", {"class": "item-content"})]
find_all retreives all occurences
Related
I am trying to scrape a title from an h1 class, but I keep getting "None"
page = requests.get(URL, headers=headers)
soup = BeautifulSoup(page.content, 'html.parser')
title = soup.find('h1', {'class': 'prod-name'})
print(title)
I've also tried using this way:
name_div = soup.find_all('div', {'class': 'col-md-12 col-sm-12 col-xs-12'})[0]
name = name_div.find('h1').text
print(name)
in which case I get: "IndexError: list index out of range"
Can anybody help me out?
This is the source code:
<div class="row attachDetails __web-inspector-hidebefore-shortcut__">
<div class="row">
<div class="col-md-12 col-sm-12 col-xs-12">
<div class="brand-desc">POLO RALPH LAUREN</div>
<h1 class="prod-name">ARAN CREWNECK SWEATER</h1>
<div class="panel-group" id="accordion">
<div class="borders-overview">
<div class="panel-heading">
<h4 class="panel-title">
<label class="overview-label collapsed" data-angle="overview-label" data-toggle="collapse" data-parent="#accordion" href="#collapse1">
<a class="fa fa-angle-up pull-right"></a>
<a class="over-view">OVERVIEW</a>
<span class="color-disp over-view">COLOR: FAWN GREY HEATHER</span>
<span class="style-num over-view">MATERIAL# : 710766783002
</span></label>
</h4>
</div>
<div id="collapse1" class="panel-collapse collapse">
<div class="short-desc-section"></div>
</div>
</div>
<div class="border-details">
<div class="panel-heading">
<h4 class="panel-title">
<label class="prod-details collapsed" data-angle="prod-details" data-toggle="collapse" data-parent="#accordion" href="#collapse2">
<a class="detail-link">Details</a>
<a class="fa fa-angle-up pull-right"></a>
</label>
</h4>
</div>
<div id="collapse2" class="long-desc panel-collapse collapse">
<div><ol><li>STANDARD FIT</li><li>COTTON</li></ol></div>
<ol>
<div><li><b>Board:</b> S196SC23</li></div>
<!--***********************************************************************************************************-->
</ol>
</div>
</div>
</div>
</div>
</div>
</div>
I have following html code:
<div class="1">
<fieldset>
<legend>AAA</legend>
<div class="row">aaa</div>
<div class="row">aaa</div>
<div class="row">aaa</div>
...
</fieldset>
</div>
<div class="1">
<fieldset>
<legend>BBB</legend>
<div class="row">bbb</div>
<div class="row">bbb</div>
<div class="row">bbb</div>
...
</fieldset>
</div>
I'm trying to display only the text inside all rows, where parent tag is legend BBB (in this example - bbb,bbb,bbb).
Currently I've created the code below, but it doesn't look pretty, and I don't know how to find all rows:
bs = BeautifulSoup(request.txt, 'html.parser')
if(bs.find('legend', text='BBB')):
value = parser.find('legend').next_element.next_element.next_element.get_text().strip()
print(value)
Is there any simply way to do this? div class name is the same, just "legend" is variable.
Added a <legend>CCC</legend> so that you may see it scales.
html = """<div class="1">
<fieldset>
<legend>AAA</legend>
<div class="row">aaa</div>
<div class="row">aaa</div>
<div class="row">aaa</div>
...
</fieldset>
</div>
<div class="1">
<fieldset>
<legend>BBB</legend>
<div class="row">bbb</div>
<div class="row">bbb</div>
<div class="row">bbb</div>
...
</fieldset>
</div>
<div class="1">
<fieldset>
<legend>CCC</legend>
<div class="row">ccc</div>
<div class="row">ccc</div>
<div class="row">ccc</div>
...
</fieldset>
</div>"""
after_tag = bs.find("legend", text="BBB").parent # Grabs parent div <fieldset>.
divs = after_tag.find_all("div", {"class": "row"}) # Finds all div inside parent.
for div in divs:
print(div.text)
bbb
bbb
bbb
from bs4 import BeautifulSoup
html = """
<div class="1">
<fieldset>
<legend>AAA</legend>
<div class="row">aaa</div>
<div class="row">aaa</div>
<div class="row">aaa</div>
...
</fieldset>
</div>
<div class="1">
<fieldset>
<legend>BBB</legend>
<div class="row">bbb</div>
<div class="row">bbb</div>
<div class="row">bbb</div>
...
</fieldset>
</div>
"""
soup = BeautifulSoup(html, features='html.parser')
elements = soup.select('div > fieldset')[1]
tuple_obj = ()
for row in elements.select('div.row'):
tuple_obj = tuple_obj + (row.text,)
print(tuple_obj)
the tuple object prints out
('bbb', 'bbb', 'bbb')
I'm testing out BeautifulSoup through Python and I only see this '[]' sign when I print this code:
import requests
from bs4 import BeautifulSoup
page = requests.get('https://www.theweathernetwork.com/ca/weather/british-columbia/vancouver')
soup = BeautifulSoup(page.content, 'html.parser')
week = soup.find(id='seven-day-periods')
# print(week)
print(week.find_all('li'))
Any help would be appreciated. Thank you!
There are no li as you can see from the weeks content:
<div class="sevenDay" id="seven-day-periods">
<!-- Legend: show only when data is loaded -->
<div class="wx_legend wx_legend_hidden">
<div class="wxRow wx_detailed-metrics">
<div class="legendColumn">Feels like</div>
</div>
<div class="wxRow wx_detailed-metrics daytime">
<div class="legendColumn">Night</div>
</div>
<div class="wxRow wx_detailed-metrics nighttime">
<div class="legendColumn">Day</div>
</div>
<div class="wxRow wx_detailed-metrics">
<div class="legendColumn">POP</div>
</div>
<div class="wxRow wx_detailed-metrics">
<div class="legendColumn">Wind ()</div>
</div>
<div class="wxRow wx_detailed-metrics">
<div class="legendColumn">Wind gust ()</div>
</div>
<div class="wxRow wx_detailed-metrics daytime">
<div class="legendColumn">Hrs of Sun</div>
</div>
<div class="top-divider2"> </div>
</div>
<div class="divTableBody">
</div>
You may have got it mixed up when it is displayed in html. I believe what you are trying to obtain is the values inside legend column. This can be obtained using:
for x in week.find_all('div','legendColumn'):
print(x.findAll(text=True))
Thus now your code will be
import requests
from bs4 import BeautifulSoup
page = requests.get('https://www.theweathernetwork.com/ca/weather/british-columbia/vancouver')
soup = BeautifulSoup(page.content, 'html.parser')
week = soup.find(id='seven-day-periods')
for x in week.find_all('div','legendColumn'):
print(x.findAll(text=True))
Where the output is
['Feels like']
['Night']
['Day']
['POP']
['Wind ()']
['Wind gust ()']
['Hrs of Sun']
I was trying to scrape some real estate websites but the one I came across has same class name under one div and that div has also 2 more div which has same class name. I want to scrape child class data (I think).
I want to scrape below class data:
<div class="m-srp-card__summary__info">New Property</div>
Below is the whole block of code I'm trying to scrape:
<div class="m-srp-card__collapse js-collapse" aria-collapsed="collapsed" data-container="srp-card-
summary">
<div class="m-srp-card__summary js-collapse__content" data-content="srp-card-summary">
<input type="hidden" id="propertyArea42679361" value="888 sqft">
<div class="m-srp-card__summary__item">
<div class="m-srp-card__summary__title">carpet area</div>
<div class="m-srp-card__summary__info">888 sqft</div>
</div>
<div class="m-srp-card__summary__item">
<div class="m-srp-card__summary__title">status</div>
<div class="m-srp-card__summary__info">Ready to Move</div>
</div>
<div class="m-srp-card__summary__item">
<div class="m-srp-card__summary__title">floor</div>
<div class="m-srp-card__summary__info">9 out of 13 floors</div>
</div>
<div class="m-srp-card__summary__item">
<div class="m-srp-card__summary__title">transaction</div>
<div class="m-srp-card__summary__info">New Property</div>
</div>
<div class="m-srp-card__summary__item">
<div class="m-srp-card__summary__title">furnishing</div>
<div class="m-srp-card__summary__info">Unfurnished</div>
</div>
<div class="m-srp-card__summary__item">
<div class="m-srp-card__summary__title">facing</div>
<div class="m-srp-card__summary__info">South -West</div>
</div>
<div class="m-srp-card__summary__item">
<div class="m-srp-card__summary__title">overlooking</div>
<div class="m-srp-card__summary__info">Garden/Park, Main Road</div>
</div>
<div class="m-srp-card__summary__item">
<div class="m-srp-card__summary__title">society</div>
<div class="m-srp-card__summary__info">
<a id="project-link-42679361" class="m-srp-card__summary__link"
href="https://www.magicbricks.com/skylights-bopal-ahmedabad-pdpid-4d4235303936323633"
target="_blank">Skylights</a>
</div>
</div>
<div class="m-srp-card__summary__item">
<div class="m-srp-card__summary__title">car parking</div>
<div class="m-srp-card__summary__info">1 Covered</div>
</div>
<div class="m-srp-card__summary__item">
<div class="m-srp-card__summary__title">bathroom</div>
<div class="m-srp-card__summary__info">3</div>
</div>
<div class="m-srp-card__summary__item">
<div class="m-srp-card__summary__title">balcony</div>
<div class="m-srp-card__summary__info">2</div>
</div>
<div class="m-srp-card__summary__item">
<div class="m-srp-card__summary__title">ownership</div>
<div class="m-srp-card__summary__info">Co-operative Society</div>
</div>
</div>
<div class="m-srp-card__collapse__control js-collapse__control" data-toggle="list-collapse"
data-target="srp-card-summary" onclick="stopPage=true;">
<div class="ico m-srp-card__ico">
<svg role="icon">
<use xlink:href="#icon-caret-down"></use>
</svg>
</div>
I tried Indexing but got nothing.
Below is my code:
req = Request('https://www.magicbricks.com/property-for-sale/residential-real-estate?proptype=Multistorey-Apartment,Builder-Floor-Apartment,Penthouse,Studio-Apartment,Residential-House,Villa&Locality=Bopal&cityName=Ahmedabad', headers={'User-Agent': 'Mozilla/5.0'})
webpage = urlopen(req).read()
soup = BeautifulSoup(req, 'html.parser')
containers = soup.find_all('div', {'class': 'm-srp-card__desc flex__item'})
container = containers[0]
no_apartment = container.find('h3').find('span', {'class': 'm-srp-card__title__bhk'}).getText()
c_area = container.find('div', {'class': 'm-srp-card__summary__info'}).getText()
p_price = container.find('div', {'class': 'm-srp-card__info flex__item'})
p_type = container.find('div', {'class': 'm-srp-card__summary js-collapse__content'})[3].find('div', {'class': 'm-srp-card__summary__info'})
Thanks in advance!
import requests
from bs4 import BeautifulSoup
import csv
import re
r = requests.get('https://www.magicbricks.com/property-for-sale/residential-real-estate?proptype=Multistorey-Apartment,Builder-Floor-Apartment,Penthouse,Studio-Apartment,Residential-House,Villa&Locality=Bopal&cityName=Ahmedabad')
soup = BeautifulSoup(r.text, 'html.parser')
category = []
size = []
price = []
floor = []
for item in soup.findAll('span', {'class': 'm-srp-card__title__bhk'}):
category.append(item.get_text(strip=True))
for item in soup.findAll(text=re.compile('area$')):
size.append(item.find_next('div').text)
for item in soup.findAll('span', {'class': 'm-srp-card__price'}):
price.append(item.text)
for item in soup.findAll(text='floor'):
floor.append(item.find_next('div').text)
data = []
for items in zip(category, size, price, floor):
data.append(items)
with open('output.csv', 'w+', newline='', encoding='UTF-8-SIG') as file:
writer = csv.writer(file)
writer.writerow(['Category', 'Size', 'Price', 'Floor'])
writer.writerows(data)
print("Operation Completed")
View Output Online: click here
There is this HTML:
<div>
<div data-id="1"> </div>
<div data-id="2"> </div>
<div data-id="3"> </div>
...
<div> </div>
</div>
I need to select the inner div that have the attribute data-id (regardless of values) only. How do I achieve that with Scrapy?
You can use the following
response.css('div[data-id]').extract()
It will give you a list of all divs with data-id attribute.
[u'<div data-id="1"> </div>',
u'<div data-id="2"> </div>',
u'<div data-id="3"> </div>']
Use BeautifulSoup. Code
from bs4 import BeautifulSoup
soup = BeautifulSoup("""<div> <div data-id="1"> </div> <div data-id="2"> </div> <div data-id="3"> </div><div> </div> </div>""")
print(soup.find_all("div", {"data-id":True}))
OUTPUT:
[<div data-id="1"> </div>, <div data-id="2"> </div>, <div data-id="3"> </div>]
You can specify which attribute to be present in find or find_all with the value as True
<li class="gb_i" aria-grabbed="false">
<a class="gb_d" data-pid="192" draggable="false" href="xyz.com" id="gb192">
<div data-class="gb_u"></div>
<div data-class="gb_v"></div>
<div data-class="gb_w"></div>
<div data-class="gb_x"></div>
</a>
</li>
Take look and above example HTML code.
To get all div containing data-class in Scrapy v1.6+
response.xpath('//a[#data-pid="192"]/div[contains(#data-class,"")]').getall()
In scrapy version <1.6 you can use extract() in place of getall().
Hope this helps
scrapy shell
In [1]: b = '''
...: <div>
...: <div data-id="1">gdfg </div>
...: <div data-id="2">dgdfg </div>
...: <div data-id="3">asdasd </div>
...: <div> </div>
...: </div>
...: '''
In [2]: from scrapy import Selector
In [3]: sel = Selector(text=b, type="html")
In [4]: sel.xpath('//div[re:test(#data-id,"\d")]/text()').extract()
Out[4]: ['gdfg ', 'dgdfg ', 'asdasd ']