Python - Scrapy splash can't render this page - python

https://www.miamidade.realforeclose.com/index.cfm?zaction=AUCTION&Zmethod=PREVIEW&AUCTIONDATE=08/16/2018
This is the page that i'm trying to scrape. When i use SplashRequest to open it i get a different page with the same source.
Those are my settings for splas:
ROBOTSTXT_OBEY = False
SPLASH_URL = 'http://192.168.99.100:8050'
DOWNLOADER_MIDDLEWARES = {
'scrapy_splash.SplashCookiesMiddleware': 723,
'scrapy_splash.SplashMiddleware': 725,
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware':
810,
}
SPIDER_MIDDLEWARES = {
'scrapy_splash.SplashDeduplicateArgsMiddleware': 100,
}
DUPEFILTER_CLASS = 'scrapy_splash.SplashAwareDupeFilter'
My spider code:
import scrapy
from scrapy_splash import SplashRequest
class RealForeclosure(scrapy.Spider):
name = 'realForeclosure'
start_urls = [
'https://www.miamidade.realforeclose.com/index.cfm?
zaction=user&zmethod=calendar'
]
def parse(self,response):
link = 'https://www.miamidade.realforeclose.com/index.cfm?
zaction=AUCTION&Zmethod=PREVIEW&AUCTIONDATE='
date = response.xpath('//div[#tabindex="0"]/#dayid').extract()[10]
yield SplashRequest(link+date, callback=self.auction)
def auction(self, response):
for i in response.css('.AUCTION_ITEM').extract():
yield {'item':i}

You need some kind of delay to allow Splash render result:
script1 = """
function main(splash, args)
assert (splash:go(args.url))
assert (splash:wait(0.5))
return {
html = splash: html(),
png = splash:png(),
har = splash:har(),
}
end
"""
def parse(self,response):
link = 'https://www.miamidade.realforeclose.com/index.cfm?zaction=AUCTION&Zmethod=PREVIEW&AUCTIONDATE='
date = response.xpath('//div[#tabindex="0"]/#dayid').extract()[10]
yield SplashRequest(
link+date,
callback=self.auction,
endpoint='execute',
args={
'html': 1,
'lua_source': self.script1,
'wait': 0.5,
}
)

Related

Why Splash response is not readable after 200 response?

I have a js rendering website and while I am scraping, I am getting 200 response as follow;
<200 https://www.sokmarket.com.tr/mevsim-sebzeleri-c-1821>
Here is my request with Splash;
def start_requests(self):
lua_script = '''
function main(splash)
local url = splash.args.url
assert(splash:go(url))
assert(splash:wait(5))
return {
html = splash:html(),
url = splash:url(),
}
end
'''
urls = Findurl(self.name)
for url in urls:
request = SplashRequest(url, self.parse,
args={"lua_source":lua_script}
)
yield request
After create request, I am printing the body like this;
def parse(self, response):
print(response)
print(response.text)
And here is the not readable output;
<html><head></head><body>�X�r����R�Z�e��$�S;�y�&{��&yP�� �)aL����+�c���_i�/+;U����A_N7u}�Po2��:Mn�ͧ�0���V6΁E7�)hf�K�r��.tܾ�;��R���Di��ݲV��I�<R���F�װ�3v��R�
� ��!Tm? g����br��Cu,%���]ٱ�y��Ea�Fl4��e!0�r|��^L��Km?~G퐅Kh�Rh%�2(!d�|d?!�+ȏvw�ږ)�H��:.n- �9>�����ڝ�ݙ��{���#�;�L*}�n�#���� �kΒv��OR��i���E���9λ��KH�-����C�X�`oy�x��A���2���*6�"a)�0
�o���f�������Nn�rtN�\���-��/�S��~`]�֗=+诃������S��N&�#�?
�(m���g�^�8�"��Y�A����y����Z_b�v����m�^�޴h��}��eYm-�p�~A\���h0��5��E��_�s�[�Zo�{�>���H
��ww�a0�M��n����I'�r�#㸻2쪳�;Lƭ?��Hh|��_��QS����{�6/����15Kt��#��O���n�'6B��u#�D|'.N����]�4&�R��=�h��M<i���}�b.�qh�m��:���G��F-�����<�wϰ��?�l;;#Ϫ�m1_�����+�O�>o�&'vJ��1�zo�f����հ�h��9���N�r]U���\y�"�7l6�¿���q!B����z�� ƨj�0�q�1��[��td_p2�="">lҹLZ�����ҩX ��Zϝ�x/A_%��eT$`�\���#���F�[����TN�[X{��M���Z�xHO�i�<�ey�M�;�V��y��BU,���"I\��G��p�%�Eы.���O�v��ڤVH�3�0��P��>����|#B#ʼ\��6�J>�t��B��,���������:��
��+b� .4�M��.#�:�+�X����[�J#yi����ۄ�0i�]�4���;
D8w�T�'غ��%vek�%xO�6��M�?
��7G�.q� ��R�F�ĥUh�pt�7G��I�ƾ�p��|�ƈ#�'O"r/�"�1FG�����H/y�`�a���d�HA.o�R�?�ڒzJ�hwRM�a�J�x̜#�u�>fd $���S11�� y=��N�)^�L�#��u��&��TY�"���(xR�S��N����E~�x,</i���}�b.�qh�m!��V����L;Z��}�w���jΥ����F�m�|�B�E�)���#U�<m��Χ���?�x_�. ��="" ��ah6g="" ��i�w�a="c�����p����Tc z��></g��3a��,li[iyb��5��></m��Χ���?�x_�.></body></html> :]�}��}����ֺ�ݱڭG9$" `q�0��i�="" f�?�v�d�i�r�x�*�="0 %��z���5��g�w.{�i�e��="">�.Gcޞ��>
I run splash with Docker compose and all tests works. I add following lines to my settings;
SPIDER_MIDDLEWARES = {
'crawler.crawler.middlewares.FirstBotSpiderMiddleware': 543,
'scrapy_splash.SplashDeduplicateArgsMiddleware': 100,
}
HTTPCACHE_STORAGE = 'scrapy_splash.SplashAwareFSCacheStorage'
DUPEFILTER_CLASS = 'scrapy_splash.SplashAwareDupeFilter'
DOWNLOADER_MIDDLEWARES = {
'scrapy_splash.SplashCookiesMiddleware': 723,
'scrapy_splash.SplashMiddleware': 725,
'scrapy.downloadermiddlewares.retry.RetryMiddleware': 90,
'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware': 110,
}
SPLASH_URL = 'http://localhost:8050'
I also test with Selenium and it crawls the page readable. But I need to crawl with Splash.
I try with proxy but the result is the same. I am new at splash. I don't get what am I missing?.

scrapy-playwright:- Downloader/handlers: scrapy.exceptions.NotSupported: AsyncioSelectorReactor

I tried to extract some data from dynamically loaded javascript website using scrapy-playwright but I stuck at the very beginning.
From where I'm facing trubles in settings.py file is as follows:
#playwright
DOWNLOAD_HANDLERS = {
"http": "scrapy_playwright.handler.ScrapyPlaywrightDownloadHandler",
"https": "scrapy_playwright.handler.ScrapyPlaywrightDownloadHandler",
}
#TWISTED_REACTOR = 'twisted.internet.asyncioreactor.AsyncioSelectorReactor'
#ASYNCIO_EVENT_LOOP = 'uvloop.Loop'
When I inject the following scrapy-playwright hanndler:
DOWNLOAD_HANDLERS = {
"http": "scrapy_playwright.handler.ScrapyPlaywrightDownloadHandler",
"https": "scrapy_playwright.handler.ScrapyPlaywrightDownloadHandler",
}
Then I got:
scrapy.exceptions.NotSupported: Unsupported URL scheme 'https': The installed reactor
(twisted.internet.selectreactor.SelectReactor) does not match the requested one (twisted.internet.asyncioreactor.AsyncioSelectorReactor)
When I inject TWISTED_REACTOR"
TWISTED_REACTOR = 'twisted.internet.asyncioreactor.AsyncioSelectorReactor'
Then I got:
raise TypeError(
TypeError: SelectorEventLoop required, instead got: <ProactorEventLoop running=False closed=False debug=False>
After all,When I inject ASYNCIO_EVENT_LOOP
Then I got:
ModuleNotFoundError: No module named 'uvloop'
At last, fail to install 'uvloop'
pip install uvloop
Script
import scrapy
from scrapy_playwright.page import PageCoroutine
class ProductSpider(scrapy.Spider):
name = 'product'
def start_requests(self):
yield scrapy.Request(
'https://shoppable-campaign-demo.netlify.app/#/',
meta={
'playwright': True,
'playwright_include_page': True,
'playwright_page_coroutines': [
PageCoroutine("wait_for_selector", "div#productListing"),
]
}
)
async def parse(self, response):
pass
# parses content
It's been suggested by the developers of scrapy_playwright to instantiate the DOWNLOAD_HANDLERS and TWISTER_REACTOR into your script.
A similar comment is provided here
Here's a working script implementing just this:
import scrapy
from scrapy_playwright.page import PageCoroutine
from scrapy.crawler import CrawlerProcess
class ProductSpider(scrapy.Spider):
name = 'product'
def start_requests(self):
yield scrapy.Request(
'https://shoppable-campaign-demo.netlify.app/#/',
callback = self.parse,
meta={
'playwright': True,
'playwright_include_page': True,
'playwright_page_coroutines': [
PageCoroutine("wait_for_selector", "div#productListing"),
]
}
)
async def parse(self, response):
container = response.xpath("(//div[#class='col-md-6'])[1]")
for items in container:
yield {
'products':items.xpath("(//h3[#class='card-title'])[1]//text()").get()
}
# parses content
if __name__ == "__main__":
process = CrawlerProcess(
settings={
"TWISTED_REACTOR": "twisted.internet.asyncioreactor.AsyncioSelectorReactor",
"DOWNLOAD_HANDLERS": {
"https": "scrapy_playwright.handler.ScrapyPlaywrightDownloadHandler",
"http": "scrapy_playwright.handler.ScrapyPlaywrightDownloadHandler",
},
"CONCURRENT_REQUESTS": 32,
"FEED_URI":'Products.jl',
"FEED_FORMAT":'jsonlines',
}
)
process.crawl(ProductSpider)
process.start()
And we get the following output:
{'products': 'Oxford Loafers'}
If you are using Windows then your problem is that Playwright doesn't support Windows. Check it out here https://github.com/scrapy-plugins/scrapy-playwright/issues/154

Scrapy - Splash fetch dynamic data

I am trying to fetch dynamic phone number from this page (among others): https://www.europages.fr/LEMMERFULLWOOD-GMBH/DEU241700-00101.html
The phone number appears after a click on the element div with the class page-action click-tel. I am trying to get to this data with scrapy_splash using a LUA script to execute a click.
After pulling splash on my ubuntu:
sudo docker run -d -p 8050:8050 scrapinghub/splash
Here is my code so far (I am using a proxy service) :
class company(scrapy.Spider):
name = "company"
custom_settings = {
"FEEDS" : {
'/home/ubuntu/scraping/europages/data/company.json': {
'format': 'jsonlines',
'encoding': 'utf8'
}
},
"DOWNLOADER_MIDDLEWARES" : {
'scrapy_splash.SplashCookiesMiddleware': 723,
'scrapy_splash.SplashMiddleware': 725,
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware': 810,
},
"SPLASH_URL" : 'http://127.0.0.1:8050/',
"SPIDER_MIDDLEWARES" : {
'scrapy_splash.SplashDeduplicateArgsMiddleware': 100,
},
"DUPEFILTER_CLASS" : 'scrapy_splash.SplashAwareDupeFilter',
"HTTPCACHE_STORAGE" : 'scrapy_splash.SplashAwareFSCacheStorage'
}
allowed_domains = ['www.europages.fr']
def __init__(self, company_url):
self.company_url = "https://www.europages.fr/LEMMERFULLWOOD-GMBH/DEU241700-00101.html" ##forced
self.item = company_item()
self.script = """
function main(splash)
splash.private_mode_enabled = false
assert(splash:go(splash.args.url))
assert(splash:wait(0.5))
local element = splash:select('.page-action.click-tel')
local bounds = element:bounds()
element:mouse_click{x=bounds.width/2, y=bounds.height/2}
splash:wait(4)
return splash:html()
end
"""
def start_requests(self):
yield scrapy.Request(
url = self.company_url,
callback = self.parse,
dont_filter = True,
meta = {
'splash': {
'endpoint': 'execute',
'url': self.company_url,
'args': {
'lua_source': self.script,
'proxy': 'http://usernamepassword#proxyhost:port',
'html':1,
'iframes':1
}
}
}
)
def parse(self, response):
soup = BeautifulSoup(response.body, "lxml")
print(soup.find('div',{'class','page-action click-tel'}))
The problem is that it has no effect, I still have nothing as if no button were clicked.
Shouldn't the return splash:html() return the results of element:mouse_click{x=bounds.width/2, y=bounds.height/2} (as element:mouse_click() waits for the changes to appear) in response.body ?
Am I missing something here ?
Most times when sites load data dynamically, they do so via background XHR requests to the server. A close examination of the network tab when you click the 'telephone' button, shows that the browser sends an XHR request to the url https://www.europages.fr/InfosTelecomJson.json?uidsid=DEU241700-00101&id=1330. You can emulate the same in your spider and avoid using scrapy splash altogether. See sample implementation below using one url:
import scrapy
from urllib.parse import urlparse
class Company(scrapy.Spider):
name = 'company'
allowed_domains = ['www.europages.fr']
start_urls = ['https://www.europages.fr/LEMMERFULLWOOD-GMBH/DEU241700-00101.html']
def parse(self, response):
# obtain the id and uuid to make xhr request
uuid = urlparse(response.url).path.split('/')[-1].rstrip('.html')
id = response.xpath("//div[#itemprop='telephone']/a/#onclick").re_first(r"event,'(\d+)',")
yield scrapy.Request(f"https://www.europages.fr/InfosTelecomJson.json?uidsid={uuid}&id={id}", callback=self.parse_address)
def parse_address(self, response):
yield response.json()
I get the response
{'digits': '+49 220 69 53 30'}

Scrapy Splash click on link with javascript href

I am using Scrapy Splash to scrape a page that has an element like this:
Page 1 of 349
|
1
|
2
|
3
|
4
|
5
|
6
|
7
|
8
|
9
|
10
|
Next
›
I want to 'click' the anchor with the text 'Next', and have the javascript execute to fetch the next page.
This is what my scraper looks like:
script = """
function main(splash)
splash:init_cookies(splash.args.cookies)
assert(splash:go(splash.args.url))
splash:wait(0.5)
local element = splash:select('div.result-content-columns div.result-title')
local bounds = element:bounds()
element:mouse_click{x=bounds.width/2, y=bounds.height/2}
return {
cookies = splash:get_cookies(),
html = splash:html()
}
end
"""
class MySpider(scrapy.Spider):
custom_settings = {
'DOWNLOADER_MIDDLEWARES' : {
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware': None,
'scrapy_fake_useragent.middleware.RandomUserAgentMiddleware': 400,
'scrapy_splash.SplashCookiesMiddleware': 723,
'scrapy_splash.SplashMiddleware': 725,
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware': 810,
},
'SPLASH_URL': 'http://192.168.59.103:8050',
'SPIDER_MIDDLEWARES': {
'scrapy_splash.SplashDeduplicateArgsMiddleware': 100,
},
'DUPEFILTER_CLASS': 'scrapy_splash.SplashAwareDupeFilter',
'HTTPCACHE_STORAGE': 'scrapy_splash.SplashAwareFSCacheStorage',
}
def star_requests(self):
yield SplashRequest(url=some_url, meta={'cookiejar': 1},
callback=self.parse,
cookies={'store_language':'en'},
endpoint='render.html',
args={'wait': 5},
)
def parse(self, response):
self.extract_data_from_page(response)
href = response.xpath('//div[#class="paging"]/p/a[contains(text(),"Next")]/#href')
if href:
new_url = href.extract_first()
yield SplashRequest(new_url, self.parse,
cookies={'store_language':'en'},
endpoint='execute', args={'lua_source': self.script})
The Lua script is incorrect (I copied it from an unrelated example). My question is that how do I pass the required args to the Lua script, so that the javascript is run?
You can pass additional arguments (docs) to the Lua script by adding the values to the SplashRequest's args:
javascript = "doSubmit('frmRow',1,0)"
yield SplashRequest(new_url, self.parse,
cookies={'store_language':'en'},
endpoint='execute',
args={'lua_source': self.script, 'javascript': javascript})
Inside the Lua script you can get the value in args and execute the Javascript with runjs:
function main(splash, args)
-- ...
-- Get the argument here:
local javascript = args.javascript
-- Run the JS:
assert(splash:runjs(javascript))
return {
html = splash:html()
}
end

Scrapy: How to get cookies from splash

I am trying to get the cookies from a splash request, but I keep getting an error.
Here is the code I am using:
class P2PEye(scrapy.Spider):
name = 'p2peyeSpider'
allowed_domains = ['p2peye.com']
start_urls = ['https://www.p2peye.com/platform/h9/']
def start_requests(self):
script = '''
function main(splash)
local url = splash.args.url
assert(splash:go(url))
assert(splash:wait(0.5))
return {
cookies = splash:get_cookies(),
}
end
'''
for url in self.start_urls:
yield SplashRequest(url, callback=self.parse, endpoint='render.html',args={'wait': 1, 'lua_source': script})
def parse(self, response):
print(response.request.headers.getlist('Set-Cookie'))
print(response.cookiejar)
This is my settings.py
SPLASH_URL = 'http://127.0.0.1:8050'
CRAWLERA_ENABLED= False
DOWNLOADER_MIDDLEWARES = {
'scrapy_splash.SplashCookiesMiddleware': 723,
'scrapy_splash.SplashMiddleware': 725,
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware': 810,
}
SPIDER_MIDDLEWARES = {'scrapy_splash.SplashDeduplicateArgsMiddleware': 100 }
DUPEFILTER_CLASS = 'scrapy_splash.SplashAwareDupeFilter'
HTTPCACHE_STORAGE = 'scrapy_splash.SplashAwareFSCacheStorage'
COOKIES_ENABLED = True
COOKIES_DEBUG = True
SPLASH_COOKIES_DEBUG = True
The result of response.request.headers.getlist('Set-Cookie') is [],
and response.cookiejar got an error: AttributeError: 'SplashTextResponse' object has no attribute 'cookiejar'.
So how can I get the cookies without causing an error?
To access response.cookiejar you need to return SplashJsonResponse
try returning extra fields on your Lua script:
script = '''
function main(splash)
local url = splash.args.url
assert(splash:go(url))
assert(splash:wait(0.5))
local entries = splash:history()
local last_response = entries[#entries].response
return {
url = splash:url(),
headers = last_response.headers,
http_status = last_response.status,
cookies = splash:get_cookies(),
html = splash:html(),
}
end
'''
Using the LUA script below the response will be a dict with cookies located at key cookies
function main(splash)
local url = splash.args.url
assert(splash:go(url))
assert(splash:wait(0.5))
return {
cookies = splash:get_cookies(),
}
end
So to access you should use
# d = requests.post('splash').json()
print(d['cookies'])

Categories