Scrapy select HTML elements that have specific attribute name - python

There is this HTML:
<div>
<div data-id="1"> </div>
<div data-id="2"> </div>
<div data-id="3"> </div>
...
<div> </div>
</div>
I need to select the inner div that have the attribute data-id (regardless of values) only. How do I achieve that with Scrapy?

You can use the following
response.css('div[data-id]').extract()
It will give you a list of all divs with data-id attribute.
[u'<div data-id="1"> </div>',
u'<div data-id="2"> </div>',
u'<div data-id="3"> </div>']

Use BeautifulSoup. Code
from bs4 import BeautifulSoup
soup = BeautifulSoup("""<div> <div data-id="1"> </div> <div data-id="2"> </div> <div data-id="3"> </div><div> </div> </div>""")
print(soup.find_all("div", {"data-id":True}))
OUTPUT:
[<div data-id="1"> </div>, <div data-id="2"> </div>, <div data-id="3"> </div>]
You can specify which attribute to be present in find or find_all with the value as True

<li class="gb_i" aria-grabbed="false">
<a class="gb_d" data-pid="192" draggable="false" href="xyz.com" id="gb192">
<div data-class="gb_u"></div>
<div data-class="gb_v"></div>
<div data-class="gb_w"></div>
<div data-class="gb_x"></div>
</a>
</li>
Take look and above example HTML code.
To get all div containing data-class in Scrapy v1.6+
response.xpath('//a[#data-pid="192"]/div[contains(#data-class,"")]').getall()
In scrapy version <1.6 you can use extract() in place of getall().
Hope this helps

scrapy shell
In [1]: b = '''
...: <div>
...: <div data-id="1">gdfg </div>
...: <div data-id="2">dgdfg </div>
...: <div data-id="3">asdasd </div>
...: <div> </div>
...: </div>
...: '''
In [2]: from scrapy import Selector
In [3]: sel = Selector(text=b, type="html")
In [4]: sel.xpath('//div[re:test(#data-id,"\d")]/text()').extract()
Out[4]: ['gdfg ', 'dgdfg ', 'asdasd ']

Related

Beautifulsoup get element with the same class

I'm having trouble parsing HTML elements with "class" attribute using Beautifulsoup.
The html code is like this :
<div class="info-item">
<div class="item-name">Model</div>
<div class="item-content">XPANDER 1.5L GLX</div>
</div>
<div class="info-item">
<div class="item-name">Transmission</div>
<div class="item-content"> MT </div>
</div>
<div class="info-item">
<div class="item-name">Engine Capacity (cc)</div>
<div class="item-content">1499 cc</div>
</div>
<div class="info-item">
<div class="item-name">Fuel</div>
<div class="item-content">Bensin </div>
</div>
I nead to get data (XPANDER 1.5L GLX, MT, 1499, Gasoline)
I try with script detail.find(class_='item-content') just only get XPANDER 1.5L GLX
please help
Use .find_all() or .select():
from bs4 import BeautifulSoup
html_doc = """
<div class="info-item">
<div class="item-name">Model</div>
<div class="item-content">XPANDER 1.5L GLX</div>
</div>
<div class="info-item">
<div class="item-name">Transmission</div>
<div class="item-content"> MT </div>
</div>
<div class="info-item">
<div class="item-name">Engine Capacity (cc)</div>
<div class="item-content">1499 cc</div>
</div>
<div class="info-item">
<div class="item-name">Fuel</div>
<div class="item-content">Bensin </div>
</div>
"""
soup = BeautifulSoup(html_doc, "html.parser")
items = [
item.get_text(strip=True) for item in soup.find_all(class_="item-content")
]
print(*items)
Prints:
XPANDER 1.5L GLX MT 1499 cc Bensin
Or:
items = [item.get_text(strip=True) for item in soup.select(".item-content")]
You can try this
soup = BeautifulSoup(html, "html.parser")
items = [item.text for item in soup.find_all("div", {"class": "item-content"})]
find_all retreives all occurences

Parsing all elements which have tag before

I have following html code:
<div class="1">
<fieldset>
<legend>AAA</legend>
<div class="row">aaa</div>
<div class="row">aaa</div>
<div class="row">aaa</div>
...
</fieldset>
</div>
<div class="1">
<fieldset>
<legend>BBB</legend>
<div class="row">bbb</div>
<div class="row">bbb</div>
<div class="row">bbb</div>
...
</fieldset>
</div>
I'm trying to display only the text inside all rows, where parent tag is legend BBB (in this example - bbb,bbb,bbb).
Currently I've created the code below, but it doesn't look pretty, and I don't know how to find all rows:
bs = BeautifulSoup(request.txt, 'html.parser')
if(bs.find('legend', text='BBB')):
value = parser.find('legend').next_element.next_element.next_element.get_text().strip()
print(value)
Is there any simply way to do this? div class name is the same, just "legend" is variable.
Added a <legend>CCC</legend> so that you may see it scales.
html = """<div class="1">
<fieldset>
<legend>AAA</legend>
<div class="row">aaa</div>
<div class="row">aaa</div>
<div class="row">aaa</div>
...
</fieldset>
</div>
<div class="1">
<fieldset>
<legend>BBB</legend>
<div class="row">bbb</div>
<div class="row">bbb</div>
<div class="row">bbb</div>
...
</fieldset>
</div>
<div class="1">
<fieldset>
<legend>CCC</legend>
<div class="row">ccc</div>
<div class="row">ccc</div>
<div class="row">ccc</div>
...
</fieldset>
</div>"""
after_tag = bs.find("legend", text="BBB").parent # Grabs parent div <fieldset>.
divs = after_tag.find_all("div", {"class": "row"}) # Finds all div inside parent.
for div in divs:
print(div.text)
bbb
bbb
bbb
from bs4 import BeautifulSoup
html = """
<div class="1">
<fieldset>
<legend>AAA</legend>
<div class="row">aaa</div>
<div class="row">aaa</div>
<div class="row">aaa</div>
...
</fieldset>
</div>
<div class="1">
<fieldset>
<legend>BBB</legend>
<div class="row">bbb</div>
<div class="row">bbb</div>
<div class="row">bbb</div>
...
</fieldset>
</div>
"""
soup = BeautifulSoup(html, features='html.parser')
elements = soup.select('div > fieldset')[1]
tuple_obj = ()
for row in elements.select('div.row'):
tuple_obj = tuple_obj + (row.text,)
print(tuple_obj)
the tuple object prints out
('bbb', 'bbb', 'bbb')

Python BeautifulSoup No Output

I'm testing out BeautifulSoup through Python and I only see this '[]' sign when I print this code:
import requests
from bs4 import BeautifulSoup
page = requests.get('https://www.theweathernetwork.com/ca/weather/british-columbia/vancouver')
soup = BeautifulSoup(page.content, 'html.parser')
week = soup.find(id='seven-day-periods')
# print(week)
print(week.find_all('li'))
Any help would be appreciated. Thank you!
There are no li as you can see from the weeks content:
<div class="sevenDay" id="seven-day-periods">
<!-- Legend: show only when data is loaded -->
<div class="wx_legend wx_legend_hidden">
<div class="wxRow wx_detailed-metrics">
<div class="legendColumn">Feels like</div>
</div>
<div class="wxRow wx_detailed-metrics daytime">
<div class="legendColumn">Night</div>
</div>
<div class="wxRow wx_detailed-metrics nighttime">
<div class="legendColumn">Day</div>
</div>
<div class="wxRow wx_detailed-metrics">
<div class="legendColumn">POP</div>
</div>
<div class="wxRow wx_detailed-metrics">
<div class="legendColumn">Wind ()</div>
</div>
<div class="wxRow wx_detailed-metrics">
<div class="legendColumn">Wind gust ()</div>
</div>
<div class="wxRow wx_detailed-metrics daytime">
<div class="legendColumn">Hrs of Sun</div>
</div>
<div class="top-divider2"> </div>
</div>
<div class="divTableBody">
</div>
You may have got it mixed up when it is displayed in html. I believe what you are trying to obtain is the values inside legend column. This can be obtained using:
for x in week.find_all('div','legendColumn'):
print(x.findAll(text=True))
Thus now your code will be
import requests
from bs4 import BeautifulSoup
page = requests.get('https://www.theweathernetwork.com/ca/weather/british-columbia/vancouver')
soup = BeautifulSoup(page.content, 'html.parser')
week = soup.find(id='seven-day-periods')
for x in week.find_all('div','legendColumn'):
print(x.findAll(text=True))
Where the output is
['Feels like']
['Night']
['Day']
['POP']
['Wind ()']
['Wind gust ()']
['Hrs of Sun']

lxml xpath() not returning what I expect

I am following a guide (http://docs.python-guide.org/en/latest/scenarios/scrape/) to scrape a website (https://www.brookfieldproperties.com/portfolio/toronto/bay-adelaide-east/) and have gone through the lxml package website and can't figure out what's going wrong.
I have this code:
from lxml import html
import requests
page = requests.get('https://www.brookfieldproperties.com/portfolio/toronto/bay-adelaide-east/')
tree = html.fromstring(page.content)
floor = tree.xpath('//div[#class="column floor"]/text()')
sf = tree.xpath('//div[#class="column rsf"]/text()')
but floor and sf return a list of '\n\t\t\t\t' values, not an integer which you'd expect looking at the html from the actual website ("20" and "5117" in the below case):
<div class="availabilityWrap">
<h3>Availabilities</h3>
<div class="availabilityRow headerRow">
<div class="column floor">
<a href="/media/img/asset/pdf/BAC-ET-_20th_Floor_-_5100sf.pdf"
target='blank'><img src="/static/images/pdf.png" class="floorPDF" />20</a>
</div>
<div class="column rsf">
<p><b>5117</b></p>
</div>
<div class="column divisible">
<p><b>yes</b></p>
</div>
<div class="column date">
<p><b>05/01/2017</b></p>
</div>
<div class="column space">
<p><b>Office</b></p>
</div>
<div class="column description">
<p><b>model suite</b></p>
</div>
<div class="column rent">
<p><b>$26.55</b></p>
</div>
</div>
Shouldn't it just be returning all text in the "column floor" div class? Any help would be great.
floor = tree.xpath('normalize-space(//div[#class="column floor"])')
The div contains the \n\t to get new lines and space, Those are text too, you can concatenate all the text and remove the whitespace by using normalize-space() function
In [14]: '''<div class="column floor">
...:
...: <a href="/media/img/asset/pdf/BAC-ET-_20th_Floor_-_5100sf.pdf"
...: target='blank'><img src="/static/images/pdf.png" class="floorPDF" />20</a>
...:
...: </div>'''
Out[14]: '<div class="column floor">\n\n <a href="/media/img/asset/pdf/BAC-ET-_20th_Floor_-_5100sf.pdf"\n target=\'blank\'><img src="/static/images/pdf.png" class="floorPDF" />20</a>\n\n </div>'
EDIT:
for div in tree.xpath('//div[#class="column floor"]'):
print(div.xpath('normalize-space(.)')) # `.` means current node

Scrapy conditional crawling

my HTML code contains a number of divs with mostly similar structure ... below is the code excerpt containing 2 such divs
<!-- 1st Div start -->
<div class="outer-container">
<div class="inner-container">
<div class="abc xyz" title="verified"></div>
<div class="mody">
<div class="row">
<div class="col-md-5 col-xs-12">
<h2><a class="mheading primary h4" href="/c/my-llc"><strong>Top Dude, LLC</strong></a></h2>
<div class="mvsdfm casmhrn" itemprop="address">
<span itemprop="Address">1223 Industrial Blvd</span><br>
<span itemprop="Locality">Paris</span>, <span itemprop="Region">BA</span> <span itemprop="postalCode">123345</span>
</div>
<div class="hidden-device-xs" itemprop="phone" rel="mainPhone">
(800) 845-0000
</div>
</div>
</div>
</div>
</div>
</div>
<!-- 2nd Div start -->
<div class="outer-container">
<div class="inner-container">
<div class="mody">
<div class="row">
<div class="col-md-5 col-xs-12">
<h2><a class="mheading primary h4" href="/c/my-llc"><strong>Fat Dude, LLC</strong></a></h2>
<div class="mvsdfm casmhrn" itemprop="address">
<span itemprop="Address">7890 Business St</span><br>
<span itemprop="Locality">Tokyo</span>, <span itemprop="Region">MA</span> <span itemprop="postalCode">987655</span>
</div>
<div class="hidden-device-xs" itemprop="phone" rel="mainPhone">
(800) 845-0000
</div>
</div>
</div>
</div>
</div>
</div>
So here is what I want Scrapy to do ...
If the div with class="outer-container" contains another div with title="verified" like in the 1st div above, it should go to the URL above it (i.e www.xxxxxx.com) and fetch some other feilds on that page.
If there is no div containing title="verified", like in 2nd div above, it should fetch all the data under div class="mody". i.e. company name (Fat Dude, LLC), address, city, state etc ... and NOT follow the url (i.e. www.yyyyy.com)
So how do I apply this condition/logic in Scrapy crawler. I was thinking of using BeautifulSoup, but not sure ....
What have I tried so far ....
class MySpider(CrawlSpider):
name = 'dknfetch'
start_urls = ['http://www.xxxxxx.com/scrapy/all-listing']
allowed_domains = ['www.xxxxx.com']
def parse(self, response):
hxs = Selector(response)
soup = BeautifulSoup(response.body, 'lxml')
nf = NewsFields()
cName = soup.find_all("a", class_="mheading primary h4")
addrs = soup.find_all("span", itemprop_="Address")
loclity = soup.find_all("span", itemprop_="Locality")
region = soup.find_all("span", itemprop_="Region")
post = soup.find_all("span", itemprop_="postalCode")
nf['companyName'] = cName[0]['content']
nf['address'] = addrs[0]['content']
nf['locality'] = loclity[0]['content']
nf['state'] = region[0]['content']
nf['zipcode'] = post[0]['content']
yield nf
for url in hxs.xpath('//div[#class="inner-container"]/a/#href').extract():
yield Request(url, callback=self.parse)
Ofcourse, the above code returns and crawl all the URL's under the div class="inner-container" as there is no conditional crawling specified in this code, becuase I don't know where/how to set it.
If anyone has done something similar before, please share. Thanks
No need to use BeautifulSoup, Scrapy comes with it's own selector capabilities (also released separately as parsel). Let's use your HTML to make an example:
html = u"""
<!-- 1st Div start -->
<div class="outer-container">
<div class="inner-container">
<div class="abc xyz" title="verified"></div>
<div class="mody">
<div class="row">
<div class="col-md-5 col-xs-12">
<h2><a class="mheading primary h4" href="/c/my-llc"><strong>Top Dude, LLC</strong></a></h2>
<div class="mvsdfm casmhrn" itemprop="address">
<span itemprop="Address">1223 Industrial Blvd</span><br>
<span itemprop="Locality">Paris</span>, <span itemprop="Region">BA</span> <span itemprop="postalCode">123345</span>
</div>
<div class="hidden-device-xs" itemprop="phone" rel="mainPhone">
(800) 845-0000
</div>
</div>
</div>
</div>
</div>
</div>
<!-- 2nd Div start -->
<div class="outer-container">
<div class="inner-container">
<div class="mody">
<div class="row">
<div class="col-md-5 col-xs-12">
<h2><a class="mheading primary h4" href="/c/my-llc"><strong>Fat Dude, LLC</strong></a></h2>
<div class="mvsdfm casmhrn" itemprop="address">
<span itemprop="Address">7890 Business St</span><br>
<span itemprop="Locality">Tokyo</span>, <span itemprop="Region">MA</span> <span itemprop="postalCode">987655</span>
</div>
<div class="hidden-device-xs" itemprop="phone" rel="mainPhone">
(800) 845-0000
</div>
</div>
</div>
</div>
</div>
</div>
"""
from parsel import Selector
sel = Selector(text=html)
for div in sel.css('.outer-container'):
if div.css('div[title="verified"]'):
url = div.css('a::attr(href)').extract_first()
print 'verified, follow this URL:', url
else:
nf = {}
nf['companyName'] = div.xpath('string(.//h2)').extract_first()
nf['address'] = div.css('span[itemprop="Address"]::text').extract_first()
nf['locality'] = div.css('span[itemprop="Locality"]::text').extract_first()
nf['state'] = div.css('span[itemprop="Region"]::text').extract_first()
nf['zipcode'] = div.css('span[itemprop="postalCode"]::text').extract_first()
print 'not verified, extracted item is:', nf
The result for the previous snippet is:
verified, follow this URL: www.xxxxxx.com
not verified, extracted item is: {'zipcode': u'987655', 'state': u'MA', 'address': u'7890 Business St', 'locality': u'Tokyo', 'companyName': u'Fat Dude, LLC'}
But in Scrapy you don't even need to instantiate the Selector class, there is a shortcut available in the response object passed to callbacks. Also, you shouldn't be subclassing CrawlSpider, just the regular Spider class is enough. Putting it all together:
from scrapy import Spider, Request
from myproject.items import NewsFields
class MySpider(Spider):
name = 'dknfetch'
start_urls = ['http://www.xxxxxx.com/scrapy/all-listing']
allowed_domains = ['www.xxxxx.com']
def parse(self, response):
for div in response.selector.css('.outer-container'):
if div.css('div[title="verified"]'):
url = div.css('a::attr(href)').extract_first()
yield Request(url)
else:
nf = NewsFields()
nf['companyName'] = div.xpath('string(.//h2)').extract_first()
nf['address'] = div.css('span[itemprop="Address"]::text').extract_first()
nf['locality'] = div.css('span[itemprop="Locality"]::text').extract_first()
nf['state'] = div.css('span[itemprop="Region"]::text').extract_first()
nf['zipcode'] = div.css('span[itemprop="postalCode"]::text').extract_first()
yield nf
I would suggest you to get familar with Parsel's API: https://parsel.readthedocs.io/en/latest/usage.html
Happy scraping!

Categories