Crawling hidden data using Scrapy

Crawling hidden data using Scrapy - python

I just started crawling. I'm trying to crawl question and answers from website http://www.indiabix.com/verbal-ability/spotting-errors/ by downloading content using Scrapy framework and Python 2.7. I noticed that if you view its source, you'll notice that answer for every question should be in the b tag but its not:
<div class="div-spacer">
<p><span class="ib-green"><b>Answer:</b></span> Option <b class="jq-hdnakqb"></b></p>
<p><span class="ib-green"><b>Explanation:</b></span></p>
<p> No answer description available for this question. <b>Let us discuss</b>. </p>
If we inspect element on the webpage we can see the correct answer as text between the tags : Answer: Option A or B etc. for each question but the HTML source code doesn't.
To get the text within the b tag I've tried around 15 queries using xpath.
Ive written the most probable 4-5 queries as comments in the code below.
import scrapy
import urllib
import json
from errors1.items import Errors1Item
class Errors1Spider(scrapy.Spider) :
name = "errors1"
start_urls = ["http://www.indiabix.com/verbal-ability/spotting-errors/"]
def parse(self, response) :
i = 0
y = 0
j = json.loads(json.dumps(response.xpath('//td[contains(#id, "tdOption")]/text()').extract()))
x = json.loads(json.dumps(response.xpath('//div[#class="div-spacer"]/p[3]/text()').extract()))
#to get correct answer
#response.xpath('//div[#class = "div-spacer"]/p/b/text()').extract()
#response.xpath('//div[#class = "div-spacer"]/p[1]/b/text()').extract()
#response.xpath('//div[#class = "div-spacer"]/p//text()').extract()
#response.xpath('//b[#class = "jq-hdnakqb"]/text()').extract()
#response.xpath('string(//div[#class = "div-spacer"]/p/b/text())').extract()
while i<len(j) and y<len(x) :
item = Errors1Item()
item['optionA'] = j[i]
i+=1
item['optionB'] = j[i]
i+=1
item['optionC'] = j[i]
i+=1
item['optionD'] = j[i]
i+=1
item['explanation'] = x[y]
y+=1
yield item
Can someone please help me get the answer content from that webpage.
Thanks

From what I understand, there is a javascript logic involved in setting the correct option value.
What helped me to solve it is scrapyjs middleware, that uses Splash browser-as-a-service. Skipping the installation and configuration, here is the spider that I've executed:
# -*- coding: utf-8 -*-
import scrapy
class IndiaBixSpider(scrapy.Spider):
name = "indiabix"
allowed_domain = ["www.indiabix.com"]
start_urls = ["http://www.indiabix.com/verbal-ability/spotting-errors/"]
def start_requests(self):
for url in self.start_urls:
yield scrapy.Request(url, meta={
'splash': {
'endpoint': 'render.html',
'args': {'wait': 0.5}
}
})
def parse(self, response):
for question in response.css("div.bix-div-container"):
answer = question.xpath(".//input[starts-with(#id, 'hdnAnswer')]/#value").extract()
print answer
And here is what I've got on the console (correct answers):
[u'A']
[u'C']
[u'A']
[u'C']
[u'C']
[u'C']
[u'B']
[u'A']
[u'D']
[u'C']
[u'B']
[u'B']
[u'A']
[u'B']
[u'B']
See also:
https://stackoverflow.com/a/30378765/771848

Related

How to get video url from iframe?

I want to get the url of a video (.mp4) from an iframe using python(or rust) (doesn't matter which library). For example, I have:
<iframe src="https://spinning.allohalive.com/?kp=1332827&token=b51bdfc8af17dee996d3eae53726df" />
I really have no idea how to do this. Help me please! If you need some more information, just ask.
The code that I use to parse iframes from a website:
import scrapy
from cimber.models.website import Website
class KinokradSpider(scrapy.Spider):
name = "kinokrad"
start_urls = [Website.Kinokrad.value]
def __init__(self):
self.pages_count = 1
def parse(self, response):
pages_count = self.get_pages_count(response)
if self.pages_count <= pages_count:
for film in response.css("div.shorposterbox"):
film_url = film.css("div.postertitle").css("a").attrib["href"]
yield scrapy.Request(film_url, callback=self.parse_film)
next_page = f"{Website.Kinokrad.value}/page/{self.pages_count}"
if next_page is not None:
yield response.follow(next_page, callback=self.parse)
self.pages_count += 1
def parse_film(self, response):
name = response.css("div.fallsttitle").css("h1::text").get().strip()
players = []
for player in response.css("iframe::attr(src)").extract():
players.append(player)
yield {
"name": name,
"players": players
}
def get_pages_count(self, response) -> int:
links = response.css("div.navcent").css("a")
last_link = links[len(links) - 1].attrib["href"]
return int(last_link.split("/page/")[1].replace("/", "").strip())
I've been trying for 2 weeks but finally I'm asking this question on StackOverflow. First I used Bs4, then Selenium, and now scrapy. I have a large code to automatically parse iframes, but I need mp4 url. I've already tried solutions on StackOVerflow, but They don't work, so please don't remove my question.

Unable to go next page

Trying to scrape the internet archive website (Wayback Machine): https://web.archive.org/web/20150906222155mp_/https://www.zalando.co.uk/womens-clothing/.
I am succesful in scraping the 1st page content, but can't move to the next page. I have tried multiple xpath to move to next pages:
# 1
next_page_url = response.xpath("//li[a[contains(.,'>')]]//#href").extract_first() # does not work
# 2
next_page_url = response.xpath(//a[#class='catalogPagination_page' and text() ='>'])[1]//#href).get() # does not work
I have tried converting to absolute url (and without) but again with no luck.
Can anyone help with new xpath or css selectors that I can finally scrape the next pages?
Below you can see my full code:
# -*- coding: utf-8 -*-
import scrapy
class ZalandoWomenSpider(scrapy.Spider):
name = 'zalando_women_historic_2015'
allowed_domains = ['www.web.archive.org']
start_urls = ['https://web.archive.org/web/20150906222155mp_/https://www.zalando.co.uk/womens-clothing/']
def parse(self, response):
products = response.xpath("//a[#class='catalogArticlesList_productBox']")
for product in products:
link = product.xpath(".//#href").get()
absolute_url = f"https://web.archive.org{link}"
yield scrapy.Request(url=absolute_url,callback=self.parse_product,dont_filter=True,meta={'link':link})
# process next page
next_page_url = response.xpath("//li[a[contains(.,'>')]]//#href").extract_first() #(//a[#class='catalogPagination_page' and text() ='>'])[1]//#href
absolute_next_page_url = f"https://web.archive.org{next_page_url}"
#absolute_next_page_url = next_page_url
#absolute_next_page_url = response.urljoin(next_page_url)
if next_page_url:
yield scrapy.Request(url=absolute_next_page_url,callback=self.parse)
def parse_product(self, response):
link = response.request.meta['link']
brand = response.xpath("//span[#itemprop='brand']/text()").get()
price = response.xpath("//span[#class='price oldPrice nowrap']/text()").get()
price1 = response.xpath("//span[#itemprop='price']/text()").get()
price2 = response.xpath("//div[#class='boxPrice']//span[contains(#class,'price')]/text()").get()
disc_price = response.xpath("//span[#class='price specialPrice nowrap']/text()").get()
product_type = response.xpath("//span[#itemprop='name']/text()").get()
material = response.xpath("//div[#class='content']//li[contains(.,'material')]/text()").get()
yield {
'brand_name': brand,
'product_price':price,
'product_price1':price1,
'product_price2':price2,
'product_price_b4_disc':disc_price,
'link':link,
'product_type':product_type,
'material':material}

next_page_url=response.xpath(".//a[#class='catalogPagination_page' and text() ='>']/#href").get()
Will get : '/web/20150906222155/https://www.zalando.co.uk/womens-clothing/?p=2'
You can then use split("/") to remove the "/web/201509..." bit
Note 1: I used the " " quotes inside the parentheses.
Note 2: in Scrapy you can also use "response.follow" to save having to join a relative URL to a base URL.
Check this post as well:
Scrapy response.follow query

Scrapy not scraping if one item missing

I built my first scray spider in several hours for the last two days but i am stuck right now - the main purpose i wanted to achieve is to extract all data to later filter it in csv. Now, the real crucial data for me (Companies without! webpages) is dropped because scrapy can't find the xpath i provided if an item has a homepage. I tried an if statement here, but its not working.
Example website: https://www.achern.de/de/Wirtschaft/Unternehmen-A-Z/Unternehmen?view=publish&item=company&id=1345
I use xPath selector: response.xpath("//div[#class='cCore_contactInformationBlockWithIcon cCore_wwwIcon']/a/#href").extract()
Example non-website: https://www.achern.de/de/Wirtschaft/Unternehmen-A-Z/Unternehmen?view=publish&item=company&id=1512
Spider Code:
# -*- coding: utf-8 -*-
import scrapy
class AchernSpider(scrapy.Spider):
name = 'achern'
allowed_domains = ['www.achern.de']
start_urls = ['https://www.achern.de/de/Wirtschaft/Unternehmen-A-Z/']
def parse(self, response):
for href in response.xpath("//ul[#class='cCore_list cCore_customList']/li[*][*]/a/#href"):
url = response.urljoin(href.extract())
yield scrapy.Request(url, callback= self.scrape)
def scrape(self, response):
#Extracting the content using css selectors
print("Processing:"+response.url)
firma = response.css('div>#cMpu_publish_company>h2.cCore_headline::text').extract()
anschrift = response.xpath("//div[contains(#class,'cCore_addressBlock_address')]/text()").extract()
tel = response.xpath("//div[#class='cCore_contactInformationBlockWithIcon cCore_phoneIcon']/text()").extract()
mail = response.xpath(".//div[#class='cCore_contactInformationBlock']//*[contains(text(), '#')]/text()").extract()
web1 = response.xpath("//div[#class='cCore_contactInformationBlockWithIcon cCore_wwwIcon']/a/#href").extract()
if "http:" not in web1:
web = "na"
else:
web = web1
row_data=zip(firma,anschrift,tel,mail,web1) #web1 must be changed to web but then it only give out "n" for every link
#Give the extracted content row wise
for item in row_data:
#create a dictionary to store the scraped info
scraped_info = {
'Firma' : item[0],
'Anschrift' : item[1] +' 77855 Achern',
'Telefon' : item[2],
'Mail' : item[3],
'Web' : item[4],
}
#yield or give the scraped info to scrapy
yield scraped_info
So overall it should export the DROPPED items even "web" is not there..
Hope someone can help, greetings S

Using
response.css(".cCore_wwwIcon > a::attr(href)").get()
gives you a None or the website address, then you can use or to provide a default:
website = response.css(".cCore_wwwIcon > a::attr(href)").get() or 'na'
Also, I refactored your scraper to use css selectors. Note that I've used .get() instead of .extract() to get a single item, not a list, which cleans up the code quite a bit.
import scrapy
from scrapy.crawler import CrawlerProcess
class AchernSpider(scrapy.Spider):
name = 'achern'
allowed_domains = ['www.achern.de']
start_urls = ['https://www.achern.de/de/Wirtschaft/Unternehmen-A-Z/']
def parse(self, response):
for url in response.css("[class*=cCore_listRow] > a::attr(href)").extract():
yield scrapy.Request(url, callback=self.scrape)
def scrape(self, response):
# Extracting the content using css selectors
firma = response.css('.cCore_headline::text').get()
anschrift = response.css('.cCore_addressBlock_address::text').get()
tel = response.css(".cCore_phoneIcon::text").get()
mail = response.css("[href^=mailto]::attr(href)").get().replace('mailto:', '')
website = response.css(".cCore_wwwIcon > a::attr(href)").get() or 'na'
scraped_info = {
'Firma': firma,
'Anschrift': anschrift + ' 77855 Achern',
'Telefon': tel,
'Mail': mail,
'Web': website,
}
yield scraped_info
if __name__ == "__main__":
p = CrawlerProcess()
p.crawl(AchernSpider)
p.start()
output:
with website:
{'Firma': 'Wölfinger Fahrschule GmbH', 'Anschrift': 'Güterhallenstraße 8 77855 Achern', 'Telefon': '07841 6738132', 'Mail': 'info#woelfinger-fahrschule.de', 'Web': 'http://www.woelfinger-fahrschule.de'}
without website:
{'Firma': 'Zappenduster-RC Steffen Liepe', 'Anschrift': 'Am Kirchweg 16 77855 Achern', 'Telefon': '07841 6844700', 'Mail': 'Zappenduster-Rc#hotmail.de', 'Web': 'na'}

Scrapy xpath is not working perfectly, return empty data

I am trying to scrape the book title price and author from vitalsource.com.
I successfully extracted the title, author and ISBN information but I can't get the price from the webpage.
I don't understand why I can't get the data since they are all on the same webpage.
I googled and tried many hours and now it's 4:43 am here, I am tired and despair, please help me.
please check the image for more detail. The xpath is working fine in the blue area, but not working in the red area
import scrapy
from VitalSource.items import VitalsourceItem
from scrapy.spiders import SitemapSpider
class VsSpider(scrapy.Spider):
name = 'VS'
allowed_domains = ['VitalSource.com']
start_urls = ['https://www.vitalsource.com/products/cengage-unlimited-1st-edition-instant-access-1-cengage-unlimited-v9780357700006']
def parse(self, response):
item = VitalsourceItem()
item['Ebook_Title'] = response.xpath('//*[#id="content"]/div[1]/div[1]/div[1]/div/div[2]/h1/text()').extract()[1].strip()
item['Ebook_Author'] = response.xpath('//*[#id="content"]/div[1]/div[1]/div[1]/div/div[2]/p/text()').extract()[0].strip()
item['Ebook_ISBN'] = response.xpath('//*[#id="content"]/div[1]/div[1]/div[1]/div/div[2]/ul/li[2]/h2/text()').extract()[0].strip()
item['Ebook_Price'] = response.xpath('//*[#id="content"]/div[1]/div[1]/div[1]/div/div[2]/div/span[1]/span[3]/span[2]/text()')
print(item)
return item
Result Information:
{
'Ebook_Author': 'by: Cengage Unlimited',
'Ebook_ISBN': 'Print ISBN: \n 9780357700037, 0357700031',
'Ebook_Price': [],
'Ebook_Title': 'Cengage Unlimited, 1st Edition [Instant Access], 1 term (4 months)'
}

I am not sure if you want to strictly use xpath, but I will post how it's done both with xpath and css selector:
css:
response.css('.u-pull-sixth--right+ span::text').get().strip()
xpath:
response.xpath('/html[1]/body[1]/div[2]/main[1]/div[1]/div[1]/div[1]/div[1]/div[2]/div[1]/span[1]').xpath('//span[#class]//span[2]/text()').get().strip()
Result:
{'Ebook_Price': '119.99'}

struggling with Scrapy

I'm new to scrapy and I struggle a little with a special case.
Here is the scenario :
I want to scrap a website where there is a list of books.
httpx://...bookshop.../archive is the page where all the 10 firsts books are listed.
Then I want to get the informations (name, date, author) of all the books in the list. I have to go on another page for each books:
httpx://...bookshop.../book/{random_string}
So there is two types of request :
One for refreshing the list of books.
Another one for getting the book informations.
But some books can be added to the list at anytime.
So I would like to refresh the list every minutes.
and I also want to delay all the request by 5 seconds.
Here my basic solution, but it only works for one "loop" :
First I set the delay in settings.py :
DOWNLOAD_DELAY = 5
then the code of my spider :
from scrapy.loader import ItemLoader
class bookshopScraper(scrapy.Spider):
name = "bookshop"
url = "httpx://...bookshop.../archive"
history = []
last_refresh = 0
def start_requests(self):
self.last_refresh = time.time()
yield scrapy.Request(url=self.url, callback=self.parse)
def parse(self, response):
page = response.url.split("/")[3]
if page == 'archive':
return self.parse_archive(response)
else:
return self.parse_book(response)
def parse_archive(self, response):
links = response.css('SOME CSS ').extract()
for link in links:
if link not in self.history:
self.history.append(link)
yield scrapy.Request(url="httpx://...bookshop.../book/" + link, callback=self.parse)
if len(self.history) > 10:
n = len(self.history) - 10
self.history = history[-n:]
def parse_book(self, response):
"""
Load Item
"""
Now I would like to do something like :
if(time.time() > self.last_refresh + 80):
self.last_refresh = time.time()
return scrapy.Request(url=self.url, callback=self.parse, dont_filter=True)
But I really don't know how to implement this.
PS : I want the same instance of scrapy to run all the time without stopping.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Crawling hidden data using Scrapy - python

Related

How to get video url from iframe?

Unable to go next page

Scrapy not scraping if one item missing

Scrapy xpath is not working perfectly, return empty data

struggling with Scrapy

Categories

Resources