I am following a guide (http://docs.python-guide.org/en/latest/scenarios/scrape/) to scrape a website (https://www.brookfieldproperties.com/portfolio/toronto/bay-adelaide-east/) and have gone through the lxml package website and can't figure out what's going wrong.
I have this code:
from lxml import html
import requests
page = requests.get('https://www.brookfieldproperties.com/portfolio/toronto/bay-adelaide-east/')
tree = html.fromstring(page.content)
floor = tree.xpath('//div[#class="column floor"]/text()')
sf = tree.xpath('//div[#class="column rsf"]/text()')
but floor and sf return a list of '\n\t\t\t\t' values, not an integer which you'd expect looking at the html from the actual website ("20" and "5117" in the below case):
<div class="availabilityWrap">
<h3>Availabilities</h3>
<div class="availabilityRow headerRow">
<div class="column floor">
<a href="/media/img/asset/pdf/BAC-ET-_20th_Floor_-_5100sf.pdf"
target='blank'><img src="/static/images/pdf.png" class="floorPDF" />20</a>
</div>
<div class="column rsf">
<p><b>5117</b></p>
</div>
<div class="column divisible">
<p><b>yes</b></p>
</div>
<div class="column date">
<p><b>05/01/2017</b></p>
</div>
<div class="column space">
<p><b>Office</b></p>
</div>
<div class="column description">
<p><b>model suite</b></p>
</div>
<div class="column rent">
<p><b>$26.55</b></p>
</div>
</div>
Shouldn't it just be returning all text in the "column floor" div class? Any help would be great.
floor = tree.xpath('normalize-space(//div[#class="column floor"])')
The div contains the \n\t to get new lines and space, Those are text too, you can concatenate all the text and remove the whitespace by using normalize-space() function
In [14]: '''<div class="column floor">
...:
...: <a href="/media/img/asset/pdf/BAC-ET-_20th_Floor_-_5100sf.pdf"
...: target='blank'><img src="/static/images/pdf.png" class="floorPDF" />20</a>
...:
...: </div>'''
Out[14]: '<div class="column floor">\n\n <a href="/media/img/asset/pdf/BAC-ET-_20th_Floor_-_5100sf.pdf"\n target=\'blank\'><img src="/static/images/pdf.png" class="floorPDF" />20</a>\n\n </div>'
EDIT:
for div in tree.xpath('//div[#class="column floor"]'):
print(div.xpath('normalize-space(.)')) # `.` means current node
Related
I've got the below statement to check that 2 conditions exist in a
an element:
if len(driver.find_elements(By.XPATH, "//span[text()='$400.00']/../following-sibling::div/a[text()='Buy']")) > 0:
elem = driver.find_element(By.XPATH, "//span[text()='$400.00']/../following-sibling::div/a[text()='Buy']")
I've tried a few variations, including "preceding sibling::span[text()='x'", but can't seem to get the syntax correct or if I'm going about it the right way.
HTML is below. the current find_elements(By.XPATH...) correctly finds the "Total" and "Buy" class, I would like to add $20.00 in the "price" class as a condition also.
<ul>
<li class="List">
<div class="List-Content row">
<div class="Price">"$20.00"</div>
<div class="Quantity">10</div>
<div class="Change">0%</div>
<div class="Total">
<span>$400.00</span>
</div>
<div class="Buy">
<a class="Button">Buy</a>
</div>
</div>
</li>
</ul>
Using built in ElementTree
import xml.etree.ElementTree as ET
html = '''<li class="List">
<div class="List-Content row">
<div class="Price">"$20.00"</div>
<div class="Quantity">10</div>
<div class="Change">0%</div>
<div class="Total"><span>$400.00</span></div>
<div class="Buy"><a class="Button">Buy</a></div>
</div>
<div class="List-Content row">
<div class="Price">"$27.00"</div>
<div class="Quantity">10</div>
<div class="Change">0%</div>
<div class="Total"><span>$400.00</span></div>
<div class="Buy"><a class="Button">Buy</a></div>
</div>
</li>'''
items = {'Total':'$400.00','Buy':'Buy','Price':'"$20.00"'}
root = ET.fromstring(html)
first_level_divs = root.findall('div')
for first_level_div in first_level_divs:
results = {}
for k,v in items.items():
div = first_level_div.find(f'.div[#class="{k}"]')
one_level_down = len(list(div)) > 0
results[k] = list(div)[0].text if one_level_down else div.text
if results == items:
print('found')
else:
print('not found')
results = {}
output
found
not found
Given this HTML snippet
<ul>
<li class="List">
<div class="List-Content row">
<div class="Price">"$20.00"</div>
<div class="Quantity">10</div>
<div class="Change">0%</div>
<div class="Total"><span>$400.00</span></div>
<div class="Buy"><a class="Button">Buy</a></div>
</div>
</li>
</ul>
I would use this XPath:
buy_buttons = driver.find_elements(By.XPATH, """//div[
contains(#class, 'List-Content')
and div[#class = 'Price'] = '$20.00'
and div[#class = 'Total'] = '$400.00'
]//a[. = 'Buy']""")
for buy_button in buy_buttons:
print(buy_button)
The for loop replaces your if len(buy_buttons) > 0 check. It won't run when there are no results, so the if is superfluous.
I'm having trouble parsing HTML elements with "class" attribute using Beautifulsoup.
The html code is like this :
<div class="info-item">
<div class="item-name">Model</div>
<div class="item-content">XPANDER 1.5L GLX</div>
</div>
<div class="info-item">
<div class="item-name">Transmission</div>
<div class="item-content"> MT </div>
</div>
<div class="info-item">
<div class="item-name">Engine Capacity (cc)</div>
<div class="item-content">1499 cc</div>
</div>
<div class="info-item">
<div class="item-name">Fuel</div>
<div class="item-content">Bensin </div>
</div>
I nead to get data (XPANDER 1.5L GLX, MT, 1499, Gasoline)
I try with script detail.find(class_='item-content') just only get XPANDER 1.5L GLX
please help
Use .find_all() or .select():
from bs4 import BeautifulSoup
html_doc = """
<div class="info-item">
<div class="item-name">Model</div>
<div class="item-content">XPANDER 1.5L GLX</div>
</div>
<div class="info-item">
<div class="item-name">Transmission</div>
<div class="item-content"> MT </div>
</div>
<div class="info-item">
<div class="item-name">Engine Capacity (cc)</div>
<div class="item-content">1499 cc</div>
</div>
<div class="info-item">
<div class="item-name">Fuel</div>
<div class="item-content">Bensin </div>
</div>
"""
soup = BeautifulSoup(html_doc, "html.parser")
items = [
item.get_text(strip=True) for item in soup.find_all(class_="item-content")
]
print(*items)
Prints:
XPANDER 1.5L GLX MT 1499 cc Bensin
Or:
items = [item.get_text(strip=True) for item in soup.select(".item-content")]
You can try this
soup = BeautifulSoup(html, "html.parser")
items = [item.text for item in soup.find_all("div", {"class": "item-content"})]
find_all retreives all occurences
I'm testing out BeautifulSoup through Python and I only see this '[]' sign when I print this code:
import requests
from bs4 import BeautifulSoup
page = requests.get('https://www.theweathernetwork.com/ca/weather/british-columbia/vancouver')
soup = BeautifulSoup(page.content, 'html.parser')
week = soup.find(id='seven-day-periods')
# print(week)
print(week.find_all('li'))
Any help would be appreciated. Thank you!
There are no li as you can see from the weeks content:
<div class="sevenDay" id="seven-day-periods">
<!-- Legend: show only when data is loaded -->
<div class="wx_legend wx_legend_hidden">
<div class="wxRow wx_detailed-metrics">
<div class="legendColumn">Feels like</div>
</div>
<div class="wxRow wx_detailed-metrics daytime">
<div class="legendColumn">Night</div>
</div>
<div class="wxRow wx_detailed-metrics nighttime">
<div class="legendColumn">Day</div>
</div>
<div class="wxRow wx_detailed-metrics">
<div class="legendColumn">POP</div>
</div>
<div class="wxRow wx_detailed-metrics">
<div class="legendColumn">Wind ()</div>
</div>
<div class="wxRow wx_detailed-metrics">
<div class="legendColumn">Wind gust ()</div>
</div>
<div class="wxRow wx_detailed-metrics daytime">
<div class="legendColumn">Hrs of Sun</div>
</div>
<div class="top-divider2"> </div>
</div>
<div class="divTableBody">
</div>
You may have got it mixed up when it is displayed in html. I believe what you are trying to obtain is the values inside legend column. This can be obtained using:
for x in week.find_all('div','legendColumn'):
print(x.findAll(text=True))
Thus now your code will be
import requests
from bs4 import BeautifulSoup
page = requests.get('https://www.theweathernetwork.com/ca/weather/british-columbia/vancouver')
soup = BeautifulSoup(page.content, 'html.parser')
week = soup.find(id='seven-day-periods')
for x in week.find_all('div','legendColumn'):
print(x.findAll(text=True))
Where the output is
['Feels like']
['Night']
['Day']
['POP']
['Wind ()']
['Wind gust ()']
['Hrs of Sun']
There is this HTML:
<div>
<div data-id="1"> </div>
<div data-id="2"> </div>
<div data-id="3"> </div>
...
<div> </div>
</div>
I need to select the inner div that have the attribute data-id (regardless of values) only. How do I achieve that with Scrapy?
You can use the following
response.css('div[data-id]').extract()
It will give you a list of all divs with data-id attribute.
[u'<div data-id="1"> </div>',
u'<div data-id="2"> </div>',
u'<div data-id="3"> </div>']
Use BeautifulSoup. Code
from bs4 import BeautifulSoup
soup = BeautifulSoup("""<div> <div data-id="1"> </div> <div data-id="2"> </div> <div data-id="3"> </div><div> </div> </div>""")
print(soup.find_all("div", {"data-id":True}))
OUTPUT:
[<div data-id="1"> </div>, <div data-id="2"> </div>, <div data-id="3"> </div>]
You can specify which attribute to be present in find or find_all with the value as True
<li class="gb_i" aria-grabbed="false">
<a class="gb_d" data-pid="192" draggable="false" href="xyz.com" id="gb192">
<div data-class="gb_u"></div>
<div data-class="gb_v"></div>
<div data-class="gb_w"></div>
<div data-class="gb_x"></div>
</a>
</li>
Take look and above example HTML code.
To get all div containing data-class in Scrapy v1.6+
response.xpath('//a[#data-pid="192"]/div[contains(#data-class,"")]').getall()
In scrapy version <1.6 you can use extract() in place of getall().
Hope this helps
scrapy shell
In [1]: b = '''
...: <div>
...: <div data-id="1">gdfg </div>
...: <div data-id="2">dgdfg </div>
...: <div data-id="3">asdasd </div>
...: <div> </div>
...: </div>
...: '''
In [2]: from scrapy import Selector
In [3]: sel = Selector(text=b, type="html")
In [4]: sel.xpath('//div[re:test(#data-id,"\d")]/text()').extract()
Out[4]: ['gdfg ', 'dgdfg ', 'asdasd ']
my HTML code contains a number of divs with mostly similar structure ... below is the code excerpt containing 2 such divs
<!-- 1st Div start -->
<div class="outer-container">
<div class="inner-container">
<div class="abc xyz" title="verified"></div>
<div class="mody">
<div class="row">
<div class="col-md-5 col-xs-12">
<h2><a class="mheading primary h4" href="/c/my-llc"><strong>Top Dude, LLC</strong></a></h2>
<div class="mvsdfm casmhrn" itemprop="address">
<span itemprop="Address">1223 Industrial Blvd</span><br>
<span itemprop="Locality">Paris</span>, <span itemprop="Region">BA</span> <span itemprop="postalCode">123345</span>
</div>
<div class="hidden-device-xs" itemprop="phone" rel="mainPhone">
(800) 845-0000
</div>
</div>
</div>
</div>
</div>
</div>
<!-- 2nd Div start -->
<div class="outer-container">
<div class="inner-container">
<div class="mody">
<div class="row">
<div class="col-md-5 col-xs-12">
<h2><a class="mheading primary h4" href="/c/my-llc"><strong>Fat Dude, LLC</strong></a></h2>
<div class="mvsdfm casmhrn" itemprop="address">
<span itemprop="Address">7890 Business St</span><br>
<span itemprop="Locality">Tokyo</span>, <span itemprop="Region">MA</span> <span itemprop="postalCode">987655</span>
</div>
<div class="hidden-device-xs" itemprop="phone" rel="mainPhone">
(800) 845-0000
</div>
</div>
</div>
</div>
</div>
</div>
So here is what I want Scrapy to do ...
If the div with class="outer-container" contains another div with title="verified" like in the 1st div above, it should go to the URL above it (i.e www.xxxxxx.com) and fetch some other feilds on that page.
If there is no div containing title="verified", like in 2nd div above, it should fetch all the data under div class="mody". i.e. company name (Fat Dude, LLC), address, city, state etc ... and NOT follow the url (i.e. www.yyyyy.com)
So how do I apply this condition/logic in Scrapy crawler. I was thinking of using BeautifulSoup, but not sure ....
What have I tried so far ....
class MySpider(CrawlSpider):
name = 'dknfetch'
start_urls = ['http://www.xxxxxx.com/scrapy/all-listing']
allowed_domains = ['www.xxxxx.com']
def parse(self, response):
hxs = Selector(response)
soup = BeautifulSoup(response.body, 'lxml')
nf = NewsFields()
cName = soup.find_all("a", class_="mheading primary h4")
addrs = soup.find_all("span", itemprop_="Address")
loclity = soup.find_all("span", itemprop_="Locality")
region = soup.find_all("span", itemprop_="Region")
post = soup.find_all("span", itemprop_="postalCode")
nf['companyName'] = cName[0]['content']
nf['address'] = addrs[0]['content']
nf['locality'] = loclity[0]['content']
nf['state'] = region[0]['content']
nf['zipcode'] = post[0]['content']
yield nf
for url in hxs.xpath('//div[#class="inner-container"]/a/#href').extract():
yield Request(url, callback=self.parse)
Ofcourse, the above code returns and crawl all the URL's under the div class="inner-container" as there is no conditional crawling specified in this code, becuase I don't know where/how to set it.
If anyone has done something similar before, please share. Thanks
No need to use BeautifulSoup, Scrapy comes with it's own selector capabilities (also released separately as parsel). Let's use your HTML to make an example:
html = u"""
<!-- 1st Div start -->
<div class="outer-container">
<div class="inner-container">
<div class="abc xyz" title="verified"></div>
<div class="mody">
<div class="row">
<div class="col-md-5 col-xs-12">
<h2><a class="mheading primary h4" href="/c/my-llc"><strong>Top Dude, LLC</strong></a></h2>
<div class="mvsdfm casmhrn" itemprop="address">
<span itemprop="Address">1223 Industrial Blvd</span><br>
<span itemprop="Locality">Paris</span>, <span itemprop="Region">BA</span> <span itemprop="postalCode">123345</span>
</div>
<div class="hidden-device-xs" itemprop="phone" rel="mainPhone">
(800) 845-0000
</div>
</div>
</div>
</div>
</div>
</div>
<!-- 2nd Div start -->
<div class="outer-container">
<div class="inner-container">
<div class="mody">
<div class="row">
<div class="col-md-5 col-xs-12">
<h2><a class="mheading primary h4" href="/c/my-llc"><strong>Fat Dude, LLC</strong></a></h2>
<div class="mvsdfm casmhrn" itemprop="address">
<span itemprop="Address">7890 Business St</span><br>
<span itemprop="Locality">Tokyo</span>, <span itemprop="Region">MA</span> <span itemprop="postalCode">987655</span>
</div>
<div class="hidden-device-xs" itemprop="phone" rel="mainPhone">
(800) 845-0000
</div>
</div>
</div>
</div>
</div>
</div>
"""
from parsel import Selector
sel = Selector(text=html)
for div in sel.css('.outer-container'):
if div.css('div[title="verified"]'):
url = div.css('a::attr(href)').extract_first()
print 'verified, follow this URL:', url
else:
nf = {}
nf['companyName'] = div.xpath('string(.//h2)').extract_first()
nf['address'] = div.css('span[itemprop="Address"]::text').extract_first()
nf['locality'] = div.css('span[itemprop="Locality"]::text').extract_first()
nf['state'] = div.css('span[itemprop="Region"]::text').extract_first()
nf['zipcode'] = div.css('span[itemprop="postalCode"]::text').extract_first()
print 'not verified, extracted item is:', nf
The result for the previous snippet is:
verified, follow this URL: www.xxxxxx.com
not verified, extracted item is: {'zipcode': u'987655', 'state': u'MA', 'address': u'7890 Business St', 'locality': u'Tokyo', 'companyName': u'Fat Dude, LLC'}
But in Scrapy you don't even need to instantiate the Selector class, there is a shortcut available in the response object passed to callbacks. Also, you shouldn't be subclassing CrawlSpider, just the regular Spider class is enough. Putting it all together:
from scrapy import Spider, Request
from myproject.items import NewsFields
class MySpider(Spider):
name = 'dknfetch'
start_urls = ['http://www.xxxxxx.com/scrapy/all-listing']
allowed_domains = ['www.xxxxx.com']
def parse(self, response):
for div in response.selector.css('.outer-container'):
if div.css('div[title="verified"]'):
url = div.css('a::attr(href)').extract_first()
yield Request(url)
else:
nf = NewsFields()
nf['companyName'] = div.xpath('string(.//h2)').extract_first()
nf['address'] = div.css('span[itemprop="Address"]::text').extract_first()
nf['locality'] = div.css('span[itemprop="Locality"]::text').extract_first()
nf['state'] = div.css('span[itemprop="Region"]::text').extract_first()
nf['zipcode'] = div.css('span[itemprop="postalCode"]::text').extract_first()
yield nf
I would suggest you to get familar with Parsel's API: https://parsel.readthedocs.io/en/latest/usage.html
Happy scraping!