how to print a particular value from html content? - python

below is HTML content I want the only value that is available in HTML content
<div class="list-group-item">
<div class="row">
<div class="col" style="min-width: 0;">
<h2 class="h5 mt-0 text-truncate">
<a class="text-warning" href="www.example.com">
Ram
</a>
</h2>
<p class="mob-9 text-truncate">
<small>
<i class="fa fa-fw fa-mobile-alt">
</i>
Contact:
</small>
010101010
</p>
<p class="mb-2 text-truncate">
<small>
<i class="fa fa-fw fa-map-marker-alt">
</i>
Location:
</small>
5th lane, kamathipura, Kamathipura
</p>
</a>
</p>
</div>
</div>
</div>
my code is -
import pandas as pd
import requests
from bs4 import BeautifulSoup as soup
url = requests.get("www.example.com")
page_soup = soup(url.content, 'html.parser')
name = shop.findAll("div", {"class": "list-group-item"})
print(name.h2.text)
number = shop.findAll("p", {"class": "fa fa-fw fa-map-marker-alt"})
print(?)
location = shop.findAll("p", {"class": "fa fa-fw fa-map-marker-alt"})
print(?)
I need output for this by using python -
'Ram', '010101010', '5th lane, kamathipura, Kamathipura'

Using the tags and class identifiers, you can grab all contents within the regions you want. Then with content indicies you should be able to select the exact content you wish like this:
from bs4 import BeautifulSoup
url = 'myhtml.html'
with open(url) as fp:
soup = BeautifulSoup(fp, 'html.parser')
contnt1 = [soup.find('a').contents[0].replace(' ','').replace('\n','')]
contnt2 = [x.contents[2].replace(' ', '').replace('\n', '') for x in soup.find_all("p", "text-truncate")]
print(*(contnt1 + contnt2))

Have you tried location.get_text() ?
You can go here and read more about it.

Related

How to Extract data or number from button of attribute data-content = " " using Beautiful Soup 4 in python as show below

<button class="btn btn-success phone_number_btn pull-right" data-ad-id="4598214"
data-content="
<div class='text-center owner-info-popover'>
<div class='primary-lang'>Danish</div>
<h4 style='display:inline-block' class='nomargin mt5
generic-green primary-
lang'><i ***class='fa fa-phone'></i> 03333315122</h4>***
<p style='margin:10px -15px 0' class='fs12'>Mention PakWheels.com when calling Seller to get a good deal</p>
</div>
" data-html="true" data-placement="bottom" data-price="4260000" onclick="trackEvents('UsedCars','ContactPhone','Search-Free');
adPhoneCount('/used-cars/4598214/increment_phone_count')" rel="popover"><i class="fa fa-phone"></i>Show Phone No.</button>
data-content is like any other attribute - id, class, name, href, src - so you can get it from element using the same method item["data-content"]. And later you have to use standard string functions or regex. Eventually you may have to replace < and > with < > and you will have another HTML which you can use with BeautifulSoup again.
html = '''<button class="btn btn-success phone_number_btn pull-right" data-ad-id="4598214"
data-content="
<div class='text-center owner-info-popover'>
<div class='primary-lang'>Danish</div>
<h4 style='display:inline-block' class='nomargin mt5
generic-green primary-
lang'><i class='fa fa-phone'></i> 03333315122</h4>
<p style='margin:10px -15px 0' class='fs12'>Mention PakWheels.com when calling Seller to get a good deal</p>
</div>
" data-html="true" data-placement="bottom" data-price="4260000" onclick="trackEvents('UsedCars','ContactPhone','Search-Free');
adPhoneCount('/used-cars/4598214/increment_phone_count')" rel="popover"><i class="fa fa-phone"></i>Show Phone No.</button>'''
from bs4 import BeautifulSoup as BS
soup = BS(html, 'html.parser')
item = soup.find('button')
text = item['data-content']
#print(text)
soup = BS(text, 'html.parser')
item = soup.find('h4')
print(item.get_text(strip=True))
Result:
03333315122

Python BeautifulSoup No Output

I'm testing out BeautifulSoup through Python and I only see this '[]' sign when I print this code:
import requests
from bs4 import BeautifulSoup
page = requests.get('https://www.theweathernetwork.com/ca/weather/british-columbia/vancouver')
soup = BeautifulSoup(page.content, 'html.parser')
week = soup.find(id='seven-day-periods')
# print(week)
print(week.find_all('li'))
Any help would be appreciated. Thank you!
There are no li as you can see from the weeks content:
<div class="sevenDay" id="seven-day-periods">
<!-- Legend: show only when data is loaded -->
<div class="wx_legend wx_legend_hidden">
<div class="wxRow wx_detailed-metrics">
<div class="legendColumn">Feels like</div>
</div>
<div class="wxRow wx_detailed-metrics daytime">
<div class="legendColumn">Night</div>
</div>
<div class="wxRow wx_detailed-metrics nighttime">
<div class="legendColumn">Day</div>
</div>
<div class="wxRow wx_detailed-metrics">
<div class="legendColumn">POP</div>
</div>
<div class="wxRow wx_detailed-metrics">
<div class="legendColumn">Wind ()</div>
</div>
<div class="wxRow wx_detailed-metrics">
<div class="legendColumn">Wind gust ()</div>
</div>
<div class="wxRow wx_detailed-metrics daytime">
<div class="legendColumn">Hrs of Sun</div>
</div>
<div class="top-divider2"> </div>
</div>
<div class="divTableBody">
</div>
You may have got it mixed up when it is displayed in html. I believe what you are trying to obtain is the values inside legend column. This can be obtained using:
for x in week.find_all('div','legendColumn'):
print(x.findAll(text=True))
Thus now your code will be
import requests
from bs4 import BeautifulSoup
page = requests.get('https://www.theweathernetwork.com/ca/weather/british-columbia/vancouver')
soup = BeautifulSoup(page.content, 'html.parser')
week = soup.find(id='seven-day-periods')
for x in week.find_all('div','legendColumn'):
print(x.findAll(text=True))
Where the output is
['Feels like']
['Night']
['Day']
['POP']
['Wind ()']
['Wind gust ()']
['Hrs of Sun']

How to scrape same class name data

I was trying to scrape some real estate websites but the one I came across has same class name under one div and that div has also 2 more div which has same class name. I want to scrape child class data (I think).
I want to scrape below class data:
<div class="m-srp-card__summary__info">New Property</div>
Below is the whole block of code I'm trying to scrape:
<div class="m-srp-card__collapse js-collapse" aria-collapsed="collapsed" data-container="srp-card-
summary">
<div class="m-srp-card__summary js-collapse__content" data-content="srp-card-summary">
<input type="hidden" id="propertyArea42679361" value="888 sqft">
<div class="m-srp-card__summary__item">
<div class="m-srp-card__summary__title">carpet area</div>
<div class="m-srp-card__summary__info">888 sqft</div>
</div>
<div class="m-srp-card__summary__item">
<div class="m-srp-card__summary__title">status</div>
<div class="m-srp-card__summary__info">Ready to Move</div>
</div>
<div class="m-srp-card__summary__item">
<div class="m-srp-card__summary__title">floor</div>
<div class="m-srp-card__summary__info">9 out of 13 floors</div>
</div>
<div class="m-srp-card__summary__item">
<div class="m-srp-card__summary__title">transaction</div>
<div class="m-srp-card__summary__info">New Property</div>
</div>
<div class="m-srp-card__summary__item">
<div class="m-srp-card__summary__title">furnishing</div>
<div class="m-srp-card__summary__info">Unfurnished</div>
</div>
<div class="m-srp-card__summary__item">
<div class="m-srp-card__summary__title">facing</div>
<div class="m-srp-card__summary__info">South -West</div>
</div>
<div class="m-srp-card__summary__item">
<div class="m-srp-card__summary__title">overlooking</div>
<div class="m-srp-card__summary__info">Garden/Park, Main Road</div>
</div>
<div class="m-srp-card__summary__item">
<div class="m-srp-card__summary__title">society</div>
<div class="m-srp-card__summary__info">
<a id="project-link-42679361" class="m-srp-card__summary__link"
href="https://www.magicbricks.com/skylights-bopal-ahmedabad-pdpid-4d4235303936323633"
target="_blank">Skylights</a>
</div>
</div>
<div class="m-srp-card__summary__item">
<div class="m-srp-card__summary__title">car parking</div>
<div class="m-srp-card__summary__info">1 Covered</div>
</div>
<div class="m-srp-card__summary__item">
<div class="m-srp-card__summary__title">bathroom</div>
<div class="m-srp-card__summary__info">3</div>
</div>
<div class="m-srp-card__summary__item">
<div class="m-srp-card__summary__title">balcony</div>
<div class="m-srp-card__summary__info">2</div>
</div>
<div class="m-srp-card__summary__item">
<div class="m-srp-card__summary__title">ownership</div>
<div class="m-srp-card__summary__info">Co-operative Society</div>
</div>
</div>
<div class="m-srp-card__collapse__control js-collapse__control" data-toggle="list-collapse"
data-target="srp-card-summary" onclick="stopPage=true;">
<div class="ico m-srp-card__ico">
<svg role="icon">
<use xlink:href="#icon-caret-down"></use>
</svg>
</div>
I tried Indexing but got nothing.
Below is my code:
req = Request('https://www.magicbricks.com/property-for-sale/residential-real-estate?proptype=Multistorey-Apartment,Builder-Floor-Apartment,Penthouse,Studio-Apartment,Residential-House,Villa&Locality=Bopal&cityName=Ahmedabad', headers={'User-Agent': 'Mozilla/5.0'})
webpage = urlopen(req).read()
soup = BeautifulSoup(req, 'html.parser')
containers = soup.find_all('div', {'class': 'm-srp-card__desc flex__item'})
container = containers[0]
no_apartment = container.find('h3').find('span', {'class': 'm-srp-card__title__bhk'}).getText()
c_area = container.find('div', {'class': 'm-srp-card__summary__info'}).getText()
p_price = container.find('div', {'class': 'm-srp-card__info flex__item'})
p_type = container.find('div', {'class': 'm-srp-card__summary js-collapse__content'})[3].find('div', {'class': 'm-srp-card__summary__info'})
Thanks in advance!
import requests
from bs4 import BeautifulSoup
import csv
import re
r = requests.get('https://www.magicbricks.com/property-for-sale/residential-real-estate?proptype=Multistorey-Apartment,Builder-Floor-Apartment,Penthouse,Studio-Apartment,Residential-House,Villa&Locality=Bopal&cityName=Ahmedabad')
soup = BeautifulSoup(r.text, 'html.parser')
category = []
size = []
price = []
floor = []
for item in soup.findAll('span', {'class': 'm-srp-card__title__bhk'}):
category.append(item.get_text(strip=True))
for item in soup.findAll(text=re.compile('area$')):
size.append(item.find_next('div').text)
for item in soup.findAll('span', {'class': 'm-srp-card__price'}):
price.append(item.text)
for item in soup.findAll(text='floor'):
floor.append(item.find_next('div').text)
data = []
for items in zip(category, size, price, floor):
data.append(items)
with open('output.csv', 'w+', newline='', encoding='UTF-8-SIG') as file:
writer = csv.writer(file)
writer.writerow(['Category', 'Size', 'Price', 'Floor'])
writer.writerows(data)
print("Operation Completed")
View Output Online: click here

Why is BeautifulSoup extracting unreferenced tags?

<div id="reply" class="reply attachment text">
<p class="intro">
<label for="delete">
<span class="name">Name</span>
</label>
<span class="identification">0123456789</span>
</p>
</div>
With the above html I want to get the id number '0123456789'.
To get the id I tried:
ids = soup.findAll(lambda tag: tag.name == 'span' and tag.findParent('p', 'intro') and tag.findParent('p', 'intro').findParent('div', class_=re.compile("(.)*attachment(.)*$")))
and
ids = soup.findAll(lambda tag: tag.name == 'div' and tag.findChild('p', 'intro') and tag.findChild('p', 'intro').findChild('span', class_='poster_id'))
but every time I get (with .get_Text()):
#by John Smith
#0123456789
'recursive=False' gives no output
What am I doing wrong?
from bs4 import BeautifulSoup
html = '''
<div id="reply" class="reply attachment text">
<p class="intro">
<label for="delete">
<span class="name">Name</span>
</label>
<span class="identification">0123456789</span>
</p>
</div>
'''
soup = BeautifulSoup(html,'lxml')
content = soup.find_all('span', class_ = 'identification')
print(content[0].get_text())
#o/p
'0123456789'

Scrapy conditional crawling

my HTML code contains a number of divs with mostly similar structure ... below is the code excerpt containing 2 such divs
<!-- 1st Div start -->
<div class="outer-container">
<div class="inner-container">
<div class="abc xyz" title="verified"></div>
<div class="mody">
<div class="row">
<div class="col-md-5 col-xs-12">
<h2><a class="mheading primary h4" href="/c/my-llc"><strong>Top Dude, LLC</strong></a></h2>
<div class="mvsdfm casmhrn" itemprop="address">
<span itemprop="Address">1223 Industrial Blvd</span><br>
<span itemprop="Locality">Paris</span>, <span itemprop="Region">BA</span> <span itemprop="postalCode">123345</span>
</div>
<div class="hidden-device-xs" itemprop="phone" rel="mainPhone">
(800) 845-0000
</div>
</div>
</div>
</div>
</div>
</div>
<!-- 2nd Div start -->
<div class="outer-container">
<div class="inner-container">
<div class="mody">
<div class="row">
<div class="col-md-5 col-xs-12">
<h2><a class="mheading primary h4" href="/c/my-llc"><strong>Fat Dude, LLC</strong></a></h2>
<div class="mvsdfm casmhrn" itemprop="address">
<span itemprop="Address">7890 Business St</span><br>
<span itemprop="Locality">Tokyo</span>, <span itemprop="Region">MA</span> <span itemprop="postalCode">987655</span>
</div>
<div class="hidden-device-xs" itemprop="phone" rel="mainPhone">
(800) 845-0000
</div>
</div>
</div>
</div>
</div>
</div>
So here is what I want Scrapy to do ...
If the div with class="outer-container" contains another div with title="verified" like in the 1st div above, it should go to the URL above it (i.e www.xxxxxx.com) and fetch some other feilds on that page.
If there is no div containing title="verified", like in 2nd div above, it should fetch all the data under div class="mody". i.e. company name (Fat Dude, LLC), address, city, state etc ... and NOT follow the url (i.e. www.yyyyy.com)
So how do I apply this condition/logic in Scrapy crawler. I was thinking of using BeautifulSoup, but not sure ....
What have I tried so far ....
class MySpider(CrawlSpider):
name = 'dknfetch'
start_urls = ['http://www.xxxxxx.com/scrapy/all-listing']
allowed_domains = ['www.xxxxx.com']
def parse(self, response):
hxs = Selector(response)
soup = BeautifulSoup(response.body, 'lxml')
nf = NewsFields()
cName = soup.find_all("a", class_="mheading primary h4")
addrs = soup.find_all("span", itemprop_="Address")
loclity = soup.find_all("span", itemprop_="Locality")
region = soup.find_all("span", itemprop_="Region")
post = soup.find_all("span", itemprop_="postalCode")
nf['companyName'] = cName[0]['content']
nf['address'] = addrs[0]['content']
nf['locality'] = loclity[0]['content']
nf['state'] = region[0]['content']
nf['zipcode'] = post[0]['content']
yield nf
for url in hxs.xpath('//div[#class="inner-container"]/a/#href').extract():
yield Request(url, callback=self.parse)
Ofcourse, the above code returns and crawl all the URL's under the div class="inner-container" as there is no conditional crawling specified in this code, becuase I don't know where/how to set it.
If anyone has done something similar before, please share. Thanks
No need to use BeautifulSoup, Scrapy comes with it's own selector capabilities (also released separately as parsel). Let's use your HTML to make an example:
html = u"""
<!-- 1st Div start -->
<div class="outer-container">
<div class="inner-container">
<div class="abc xyz" title="verified"></div>
<div class="mody">
<div class="row">
<div class="col-md-5 col-xs-12">
<h2><a class="mheading primary h4" href="/c/my-llc"><strong>Top Dude, LLC</strong></a></h2>
<div class="mvsdfm casmhrn" itemprop="address">
<span itemprop="Address">1223 Industrial Blvd</span><br>
<span itemprop="Locality">Paris</span>, <span itemprop="Region">BA</span> <span itemprop="postalCode">123345</span>
</div>
<div class="hidden-device-xs" itemprop="phone" rel="mainPhone">
(800) 845-0000
</div>
</div>
</div>
</div>
</div>
</div>
<!-- 2nd Div start -->
<div class="outer-container">
<div class="inner-container">
<div class="mody">
<div class="row">
<div class="col-md-5 col-xs-12">
<h2><a class="mheading primary h4" href="/c/my-llc"><strong>Fat Dude, LLC</strong></a></h2>
<div class="mvsdfm casmhrn" itemprop="address">
<span itemprop="Address">7890 Business St</span><br>
<span itemprop="Locality">Tokyo</span>, <span itemprop="Region">MA</span> <span itemprop="postalCode">987655</span>
</div>
<div class="hidden-device-xs" itemprop="phone" rel="mainPhone">
(800) 845-0000
</div>
</div>
</div>
</div>
</div>
</div>
"""
from parsel import Selector
sel = Selector(text=html)
for div in sel.css('.outer-container'):
if div.css('div[title="verified"]'):
url = div.css('a::attr(href)').extract_first()
print 'verified, follow this URL:', url
else:
nf = {}
nf['companyName'] = div.xpath('string(.//h2)').extract_first()
nf['address'] = div.css('span[itemprop="Address"]::text').extract_first()
nf['locality'] = div.css('span[itemprop="Locality"]::text').extract_first()
nf['state'] = div.css('span[itemprop="Region"]::text').extract_first()
nf['zipcode'] = div.css('span[itemprop="postalCode"]::text').extract_first()
print 'not verified, extracted item is:', nf
The result for the previous snippet is:
verified, follow this URL: www.xxxxxx.com
not verified, extracted item is: {'zipcode': u'987655', 'state': u'MA', 'address': u'7890 Business St', 'locality': u'Tokyo', 'companyName': u'Fat Dude, LLC'}
But in Scrapy you don't even need to instantiate the Selector class, there is a shortcut available in the response object passed to callbacks. Also, you shouldn't be subclassing CrawlSpider, just the regular Spider class is enough. Putting it all together:
from scrapy import Spider, Request
from myproject.items import NewsFields
class MySpider(Spider):
name = 'dknfetch'
start_urls = ['http://www.xxxxxx.com/scrapy/all-listing']
allowed_domains = ['www.xxxxx.com']
def parse(self, response):
for div in response.selector.css('.outer-container'):
if div.css('div[title="verified"]'):
url = div.css('a::attr(href)').extract_first()
yield Request(url)
else:
nf = NewsFields()
nf['companyName'] = div.xpath('string(.//h2)').extract_first()
nf['address'] = div.css('span[itemprop="Address"]::text').extract_first()
nf['locality'] = div.css('span[itemprop="Locality"]::text').extract_first()
nf['state'] = div.css('span[itemprop="Region"]::text').extract_first()
nf['zipcode'] = div.css('span[itemprop="postalCode"]::text').extract_first()
yield nf
I would suggest you to get familar with Parsel's API: https://parsel.readthedocs.io/en/latest/usage.html
Happy scraping!

Categories