How to scrape same class name data - python

I was trying to scrape some real estate websites but the one I came across has same class name under one div and that div has also 2 more div which has same class name. I want to scrape child class data (I think).
I want to scrape below class data:
<div class="m-srp-card__summary__info">New Property</div>
Below is the whole block of code I'm trying to scrape:
<div class="m-srp-card__collapse js-collapse" aria-collapsed="collapsed" data-container="srp-card-
summary">
<div class="m-srp-card__summary js-collapse__content" data-content="srp-card-summary">
<input type="hidden" id="propertyArea42679361" value="888 sqft">
<div class="m-srp-card__summary__item">
<div class="m-srp-card__summary__title">carpet area</div>
<div class="m-srp-card__summary__info">888 sqft</div>
</div>
<div class="m-srp-card__summary__item">
<div class="m-srp-card__summary__title">status</div>
<div class="m-srp-card__summary__info">Ready to Move</div>
</div>
<div class="m-srp-card__summary__item">
<div class="m-srp-card__summary__title">floor</div>
<div class="m-srp-card__summary__info">9 out of 13 floors</div>
</div>
<div class="m-srp-card__summary__item">
<div class="m-srp-card__summary__title">transaction</div>
<div class="m-srp-card__summary__info">New Property</div>
</div>
<div class="m-srp-card__summary__item">
<div class="m-srp-card__summary__title">furnishing</div>
<div class="m-srp-card__summary__info">Unfurnished</div>
</div>
<div class="m-srp-card__summary__item">
<div class="m-srp-card__summary__title">facing</div>
<div class="m-srp-card__summary__info">South -West</div>
</div>
<div class="m-srp-card__summary__item">
<div class="m-srp-card__summary__title">overlooking</div>
<div class="m-srp-card__summary__info">Garden/Park, Main Road</div>
</div>
<div class="m-srp-card__summary__item">
<div class="m-srp-card__summary__title">society</div>
<div class="m-srp-card__summary__info">
<a id="project-link-42679361" class="m-srp-card__summary__link"
href="https://www.magicbricks.com/skylights-bopal-ahmedabad-pdpid-4d4235303936323633"
target="_blank">Skylights</a>
</div>
</div>
<div class="m-srp-card__summary__item">
<div class="m-srp-card__summary__title">car parking</div>
<div class="m-srp-card__summary__info">1 Covered</div>
</div>
<div class="m-srp-card__summary__item">
<div class="m-srp-card__summary__title">bathroom</div>
<div class="m-srp-card__summary__info">3</div>
</div>
<div class="m-srp-card__summary__item">
<div class="m-srp-card__summary__title">balcony</div>
<div class="m-srp-card__summary__info">2</div>
</div>
<div class="m-srp-card__summary__item">
<div class="m-srp-card__summary__title">ownership</div>
<div class="m-srp-card__summary__info">Co-operative Society</div>
</div>
</div>
<div class="m-srp-card__collapse__control js-collapse__control" data-toggle="list-collapse"
data-target="srp-card-summary" onclick="stopPage=true;">
<div class="ico m-srp-card__ico">
<svg role="icon">
<use xlink:href="#icon-caret-down"></use>
</svg>
</div>
I tried Indexing but got nothing.
Below is my code:
req = Request('https://www.magicbricks.com/property-for-sale/residential-real-estate?proptype=Multistorey-Apartment,Builder-Floor-Apartment,Penthouse,Studio-Apartment,Residential-House,Villa&Locality=Bopal&cityName=Ahmedabad', headers={'User-Agent': 'Mozilla/5.0'})
webpage = urlopen(req).read()
soup = BeautifulSoup(req, 'html.parser')
containers = soup.find_all('div', {'class': 'm-srp-card__desc flex__item'})
container = containers[0]
no_apartment = container.find('h3').find('span', {'class': 'm-srp-card__title__bhk'}).getText()
c_area = container.find('div', {'class': 'm-srp-card__summary__info'}).getText()
p_price = container.find('div', {'class': 'm-srp-card__info flex__item'})
p_type = container.find('div', {'class': 'm-srp-card__summary js-collapse__content'})[3].find('div', {'class': 'm-srp-card__summary__info'})
Thanks in advance!

import requests
from bs4 import BeautifulSoup
import csv
import re
r = requests.get('https://www.magicbricks.com/property-for-sale/residential-real-estate?proptype=Multistorey-Apartment,Builder-Floor-Apartment,Penthouse,Studio-Apartment,Residential-House,Villa&Locality=Bopal&cityName=Ahmedabad')
soup = BeautifulSoup(r.text, 'html.parser')
category = []
size = []
price = []
floor = []
for item in soup.findAll('span', {'class': 'm-srp-card__title__bhk'}):
category.append(item.get_text(strip=True))
for item in soup.findAll(text=re.compile('area$')):
size.append(item.find_next('div').text)
for item in soup.findAll('span', {'class': 'm-srp-card__price'}):
price.append(item.text)
for item in soup.findAll(text='floor'):
floor.append(item.find_next('div').text)
data = []
for items in zip(category, size, price, floor):
data.append(items)
with open('output.csv', 'w+', newline='', encoding='UTF-8-SIG') as file:
writer = csv.writer(file)
writer.writerow(['Category', 'Size', 'Price', 'Floor'])
writer.writerows(data)
print("Operation Completed")
View Output Online: click here

Related

Beautifulsoup get element with the same class

I'm having trouble parsing HTML elements with "class" attribute using Beautifulsoup.
The html code is like this :
<div class="info-item">
<div class="item-name">Model</div>
<div class="item-content">XPANDER 1.5L GLX</div>
</div>
<div class="info-item">
<div class="item-name">Transmission</div>
<div class="item-content"> MT </div>
</div>
<div class="info-item">
<div class="item-name">Engine Capacity (cc)</div>
<div class="item-content">1499 cc</div>
</div>
<div class="info-item">
<div class="item-name">Fuel</div>
<div class="item-content">Bensin </div>
</div>
I nead to get data (XPANDER 1.5L GLX, MT, 1499, Gasoline)
I try with script detail.find(class_='item-content') just only get XPANDER 1.5L GLX
please help
Use .find_all() or .select():
from bs4 import BeautifulSoup
html_doc = """
<div class="info-item">
<div class="item-name">Model</div>
<div class="item-content">XPANDER 1.5L GLX</div>
</div>
<div class="info-item">
<div class="item-name">Transmission</div>
<div class="item-content"> MT </div>
</div>
<div class="info-item">
<div class="item-name">Engine Capacity (cc)</div>
<div class="item-content">1499 cc</div>
</div>
<div class="info-item">
<div class="item-name">Fuel</div>
<div class="item-content">Bensin </div>
</div>
"""
soup = BeautifulSoup(html_doc, "html.parser")
items = [
item.get_text(strip=True) for item in soup.find_all(class_="item-content")
]
print(*items)
Prints:
XPANDER 1.5L GLX MT 1499 cc Bensin
Or:
items = [item.get_text(strip=True) for item in soup.select(".item-content")]
You can try this
soup = BeautifulSoup(html, "html.parser")
items = [item.text for item in soup.find_all("div", {"class": "item-content"})]
find_all retreives all occurences

Unable to scrape h1 class with python/beautiful soup

I am trying to scrape a title from an h1 class, but I keep getting "None"
page = requests.get(URL, headers=headers)
soup = BeautifulSoup(page.content, 'html.parser')
title = soup.find('h1', {'class': 'prod-name'})
print(title)
I've also tried using this way:
name_div = soup.find_all('div', {'class': 'col-md-12 col-sm-12 col-xs-12'})[0]
name = name_div.find('h1').text
print(name)
in which case I get: "IndexError: list index out of range"
Can anybody help me out?
This is the source code:
<div class="row attachDetails __web-inspector-hidebefore-shortcut__">
<div class="row">
<div class="col-md-12 col-sm-12 col-xs-12">
<div class="brand-desc">POLO RALPH LAUREN</div>
<h1 class="prod-name">ARAN CREWNECK SWEATER</h1>
<div class="panel-group" id="accordion">
<div class="borders-overview">
<div class="panel-heading">
<h4 class="panel-title">
<label class="overview-label collapsed" data-angle="overview-label" data-toggle="collapse" data-parent="#accordion" href="#collapse1">
<a class="fa fa-angle-up pull-right"></a>
<a class="over-view">OVERVIEW</a>
<span class="color-disp over-view">COLOR: FAWN GREY HEATHER</span>
<span class="style-num over-view">MATERIAL# : 710766783002
</span></label>
</h4>
</div>
<div id="collapse1" class="panel-collapse collapse">
<div class="short-desc-section"></div>
</div>
</div>
<div class="border-details">
<div class="panel-heading">
<h4 class="panel-title">
<label class="prod-details collapsed" data-angle="prod-details" data-toggle="collapse" data-parent="#accordion" href="#collapse2">
<a class="detail-link">Details</a>
<a class="fa fa-angle-up pull-right"></a>
</label>
</h4>
</div>
<div id="collapse2" class="long-desc panel-collapse collapse">
<div><ol><li>STANDARD FIT</li><li>COTTON</li></ol></div>
<ol>
<div><li><b>Board:</b> S196SC23</li></div>
<!--***********************************************************************************************************-->
</ol>
</div>
</div>
</div>
</div>
</div>
</div>

Parsing all elements which have tag before

I have following html code:
<div class="1">
<fieldset>
<legend>AAA</legend>
<div class="row">aaa</div>
<div class="row">aaa</div>
<div class="row">aaa</div>
...
</fieldset>
</div>
<div class="1">
<fieldset>
<legend>BBB</legend>
<div class="row">bbb</div>
<div class="row">bbb</div>
<div class="row">bbb</div>
...
</fieldset>
</div>
I'm trying to display only the text inside all rows, where parent tag is legend BBB (in this example - bbb,bbb,bbb).
Currently I've created the code below, but it doesn't look pretty, and I don't know how to find all rows:
bs = BeautifulSoup(request.txt, 'html.parser')
if(bs.find('legend', text='BBB')):
value = parser.find('legend').next_element.next_element.next_element.get_text().strip()
print(value)
Is there any simply way to do this? div class name is the same, just "legend" is variable.
Added a <legend>CCC</legend> so that you may see it scales.
html = """<div class="1">
<fieldset>
<legend>AAA</legend>
<div class="row">aaa</div>
<div class="row">aaa</div>
<div class="row">aaa</div>
...
</fieldset>
</div>
<div class="1">
<fieldset>
<legend>BBB</legend>
<div class="row">bbb</div>
<div class="row">bbb</div>
<div class="row">bbb</div>
...
</fieldset>
</div>
<div class="1">
<fieldset>
<legend>CCC</legend>
<div class="row">ccc</div>
<div class="row">ccc</div>
<div class="row">ccc</div>
...
</fieldset>
</div>"""
after_tag = bs.find("legend", text="BBB").parent # Grabs parent div <fieldset>.
divs = after_tag.find_all("div", {"class": "row"}) # Finds all div inside parent.
for div in divs:
print(div.text)
bbb
bbb
bbb
from bs4 import BeautifulSoup
html = """
<div class="1">
<fieldset>
<legend>AAA</legend>
<div class="row">aaa</div>
<div class="row">aaa</div>
<div class="row">aaa</div>
...
</fieldset>
</div>
<div class="1">
<fieldset>
<legend>BBB</legend>
<div class="row">bbb</div>
<div class="row">bbb</div>
<div class="row">bbb</div>
...
</fieldset>
</div>
"""
soup = BeautifulSoup(html, features='html.parser')
elements = soup.select('div > fieldset')[1]
tuple_obj = ()
for row in elements.select('div.row'):
tuple_obj = tuple_obj + (row.text,)
print(tuple_obj)
the tuple object prints out
('bbb', 'bbb', 'bbb')

Python BeautifulSoup No Output

I'm testing out BeautifulSoup through Python and I only see this '[]' sign when I print this code:
import requests
from bs4 import BeautifulSoup
page = requests.get('https://www.theweathernetwork.com/ca/weather/british-columbia/vancouver')
soup = BeautifulSoup(page.content, 'html.parser')
week = soup.find(id='seven-day-periods')
# print(week)
print(week.find_all('li'))
Any help would be appreciated. Thank you!
There are no li as you can see from the weeks content:
<div class="sevenDay" id="seven-day-periods">
<!-- Legend: show only when data is loaded -->
<div class="wx_legend wx_legend_hidden">
<div class="wxRow wx_detailed-metrics">
<div class="legendColumn">Feels like</div>
</div>
<div class="wxRow wx_detailed-metrics daytime">
<div class="legendColumn">Night</div>
</div>
<div class="wxRow wx_detailed-metrics nighttime">
<div class="legendColumn">Day</div>
</div>
<div class="wxRow wx_detailed-metrics">
<div class="legendColumn">POP</div>
</div>
<div class="wxRow wx_detailed-metrics">
<div class="legendColumn">Wind ()</div>
</div>
<div class="wxRow wx_detailed-metrics">
<div class="legendColumn">Wind gust ()</div>
</div>
<div class="wxRow wx_detailed-metrics daytime">
<div class="legendColumn">Hrs of Sun</div>
</div>
<div class="top-divider2"> </div>
</div>
<div class="divTableBody">
</div>
You may have got it mixed up when it is displayed in html. I believe what you are trying to obtain is the values inside legend column. This can be obtained using:
for x in week.find_all('div','legendColumn'):
print(x.findAll(text=True))
Thus now your code will be
import requests
from bs4 import BeautifulSoup
page = requests.get('https://www.theweathernetwork.com/ca/weather/british-columbia/vancouver')
soup = BeautifulSoup(page.content, 'html.parser')
week = soup.find(id='seven-day-periods')
for x in week.find_all('div','legendColumn'):
print(x.findAll(text=True))
Where the output is
['Feels like']
['Night']
['Day']
['POP']
['Wind ()']
['Wind gust ()']
['Hrs of Sun']

Scrapy conditional crawling

my HTML code contains a number of divs with mostly similar structure ... below is the code excerpt containing 2 such divs
<!-- 1st Div start -->
<div class="outer-container">
<div class="inner-container">
<div class="abc xyz" title="verified"></div>
<div class="mody">
<div class="row">
<div class="col-md-5 col-xs-12">
<h2><a class="mheading primary h4" href="/c/my-llc"><strong>Top Dude, LLC</strong></a></h2>
<div class="mvsdfm casmhrn" itemprop="address">
<span itemprop="Address">1223 Industrial Blvd</span><br>
<span itemprop="Locality">Paris</span>, <span itemprop="Region">BA</span> <span itemprop="postalCode">123345</span>
</div>
<div class="hidden-device-xs" itemprop="phone" rel="mainPhone">
(800) 845-0000
</div>
</div>
</div>
</div>
</div>
</div>
<!-- 2nd Div start -->
<div class="outer-container">
<div class="inner-container">
<div class="mody">
<div class="row">
<div class="col-md-5 col-xs-12">
<h2><a class="mheading primary h4" href="/c/my-llc"><strong>Fat Dude, LLC</strong></a></h2>
<div class="mvsdfm casmhrn" itemprop="address">
<span itemprop="Address">7890 Business St</span><br>
<span itemprop="Locality">Tokyo</span>, <span itemprop="Region">MA</span> <span itemprop="postalCode">987655</span>
</div>
<div class="hidden-device-xs" itemprop="phone" rel="mainPhone">
(800) 845-0000
</div>
</div>
</div>
</div>
</div>
</div>
So here is what I want Scrapy to do ...
If the div with class="outer-container" contains another div with title="verified" like in the 1st div above, it should go to the URL above it (i.e www.xxxxxx.com) and fetch some other feilds on that page.
If there is no div containing title="verified", like in 2nd div above, it should fetch all the data under div class="mody". i.e. company name (Fat Dude, LLC), address, city, state etc ... and NOT follow the url (i.e. www.yyyyy.com)
So how do I apply this condition/logic in Scrapy crawler. I was thinking of using BeautifulSoup, but not sure ....
What have I tried so far ....
class MySpider(CrawlSpider):
name = 'dknfetch'
start_urls = ['http://www.xxxxxx.com/scrapy/all-listing']
allowed_domains = ['www.xxxxx.com']
def parse(self, response):
hxs = Selector(response)
soup = BeautifulSoup(response.body, 'lxml')
nf = NewsFields()
cName = soup.find_all("a", class_="mheading primary h4")
addrs = soup.find_all("span", itemprop_="Address")
loclity = soup.find_all("span", itemprop_="Locality")
region = soup.find_all("span", itemprop_="Region")
post = soup.find_all("span", itemprop_="postalCode")
nf['companyName'] = cName[0]['content']
nf['address'] = addrs[0]['content']
nf['locality'] = loclity[0]['content']
nf['state'] = region[0]['content']
nf['zipcode'] = post[0]['content']
yield nf
for url in hxs.xpath('//div[#class="inner-container"]/a/#href').extract():
yield Request(url, callback=self.parse)
Ofcourse, the above code returns and crawl all the URL's under the div class="inner-container" as there is no conditional crawling specified in this code, becuase I don't know where/how to set it.
If anyone has done something similar before, please share. Thanks
No need to use BeautifulSoup, Scrapy comes with it's own selector capabilities (also released separately as parsel). Let's use your HTML to make an example:
html = u"""
<!-- 1st Div start -->
<div class="outer-container">
<div class="inner-container">
<div class="abc xyz" title="verified"></div>
<div class="mody">
<div class="row">
<div class="col-md-5 col-xs-12">
<h2><a class="mheading primary h4" href="/c/my-llc"><strong>Top Dude, LLC</strong></a></h2>
<div class="mvsdfm casmhrn" itemprop="address">
<span itemprop="Address">1223 Industrial Blvd</span><br>
<span itemprop="Locality">Paris</span>, <span itemprop="Region">BA</span> <span itemprop="postalCode">123345</span>
</div>
<div class="hidden-device-xs" itemprop="phone" rel="mainPhone">
(800) 845-0000
</div>
</div>
</div>
</div>
</div>
</div>
<!-- 2nd Div start -->
<div class="outer-container">
<div class="inner-container">
<div class="mody">
<div class="row">
<div class="col-md-5 col-xs-12">
<h2><a class="mheading primary h4" href="/c/my-llc"><strong>Fat Dude, LLC</strong></a></h2>
<div class="mvsdfm casmhrn" itemprop="address">
<span itemprop="Address">7890 Business St</span><br>
<span itemprop="Locality">Tokyo</span>, <span itemprop="Region">MA</span> <span itemprop="postalCode">987655</span>
</div>
<div class="hidden-device-xs" itemprop="phone" rel="mainPhone">
(800) 845-0000
</div>
</div>
</div>
</div>
</div>
</div>
"""
from parsel import Selector
sel = Selector(text=html)
for div in sel.css('.outer-container'):
if div.css('div[title="verified"]'):
url = div.css('a::attr(href)').extract_first()
print 'verified, follow this URL:', url
else:
nf = {}
nf['companyName'] = div.xpath('string(.//h2)').extract_first()
nf['address'] = div.css('span[itemprop="Address"]::text').extract_first()
nf['locality'] = div.css('span[itemprop="Locality"]::text').extract_first()
nf['state'] = div.css('span[itemprop="Region"]::text').extract_first()
nf['zipcode'] = div.css('span[itemprop="postalCode"]::text').extract_first()
print 'not verified, extracted item is:', nf
The result for the previous snippet is:
verified, follow this URL: www.xxxxxx.com
not verified, extracted item is: {'zipcode': u'987655', 'state': u'MA', 'address': u'7890 Business St', 'locality': u'Tokyo', 'companyName': u'Fat Dude, LLC'}
But in Scrapy you don't even need to instantiate the Selector class, there is a shortcut available in the response object passed to callbacks. Also, you shouldn't be subclassing CrawlSpider, just the regular Spider class is enough. Putting it all together:
from scrapy import Spider, Request
from myproject.items import NewsFields
class MySpider(Spider):
name = 'dknfetch'
start_urls = ['http://www.xxxxxx.com/scrapy/all-listing']
allowed_domains = ['www.xxxxx.com']
def parse(self, response):
for div in response.selector.css('.outer-container'):
if div.css('div[title="verified"]'):
url = div.css('a::attr(href)').extract_first()
yield Request(url)
else:
nf = NewsFields()
nf['companyName'] = div.xpath('string(.//h2)').extract_first()
nf['address'] = div.css('span[itemprop="Address"]::text').extract_first()
nf['locality'] = div.css('span[itemprop="Locality"]::text').extract_first()
nf['state'] = div.css('span[itemprop="Region"]::text').extract_first()
nf['zipcode'] = div.css('span[itemprop="postalCode"]::text').extract_first()
yield nf
I would suggest you to get familar with Parsel's API: https://parsel.readthedocs.io/en/latest/usage.html
Happy scraping!

Categories