Scrapy Splash click on link with javascript href - python

I am using Scrapy Splash to scrape a page that has an element like this:
Page 1 of 349
|
1
|
2
|
3
|
4
|
5
|
6
|
7
|
8
|
9
|
10
|
Next
›
I want to 'click' the anchor with the text 'Next', and have the javascript execute to fetch the next page.
This is what my scraper looks like:
script = """
function main(splash)
splash:init_cookies(splash.args.cookies)
assert(splash:go(splash.args.url))
splash:wait(0.5)
local element = splash:select('div.result-content-columns div.result-title')
local bounds = element:bounds()
element:mouse_click{x=bounds.width/2, y=bounds.height/2}
return {
cookies = splash:get_cookies(),
html = splash:html()
}
end
"""
class MySpider(scrapy.Spider):
custom_settings = {
'DOWNLOADER_MIDDLEWARES' : {
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware': None,
'scrapy_fake_useragent.middleware.RandomUserAgentMiddleware': 400,
'scrapy_splash.SplashCookiesMiddleware': 723,
'scrapy_splash.SplashMiddleware': 725,
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware': 810,
},
'SPLASH_URL': 'http://192.168.59.103:8050',
'SPIDER_MIDDLEWARES': {
'scrapy_splash.SplashDeduplicateArgsMiddleware': 100,
},
'DUPEFILTER_CLASS': 'scrapy_splash.SplashAwareDupeFilter',
'HTTPCACHE_STORAGE': 'scrapy_splash.SplashAwareFSCacheStorage',
}
def star_requests(self):
yield SplashRequest(url=some_url, meta={'cookiejar': 1},
callback=self.parse,
cookies={'store_language':'en'},
endpoint='render.html',
args={'wait': 5},
)
def parse(self, response):
self.extract_data_from_page(response)
href = response.xpath('//div[#class="paging"]/p/a[contains(text(),"Next")]/#href')
if href:
new_url = href.extract_first()
yield SplashRequest(new_url, self.parse,
cookies={'store_language':'en'},
endpoint='execute', args={'lua_source': self.script})
The Lua script is incorrect (I copied it from an unrelated example). My question is that how do I pass the required args to the Lua script, so that the javascript is run?

You can pass additional arguments (docs) to the Lua script by adding the values to the SplashRequest's args:
javascript = "doSubmit('frmRow',1,0)"
yield SplashRequest(new_url, self.parse,
cookies={'store_language':'en'},
endpoint='execute',
args={'lua_source': self.script, 'javascript': javascript})
Inside the Lua script you can get the value in args and execute the Javascript with runjs:
function main(splash, args)
-- ...
-- Get the argument here:
local javascript = args.javascript
-- Run the JS:
assert(splash:runjs(javascript))
return {
html = splash:html()
}
end

Related

Why Splash response is not readable after 200 response?

I have a js rendering website and while I am scraping, I am getting 200 response as follow;
<200 https://www.sokmarket.com.tr/mevsim-sebzeleri-c-1821>
Here is my request with Splash;
def start_requests(self):
lua_script = '''
function main(splash)
local url = splash.args.url
assert(splash:go(url))
assert(splash:wait(5))
return {
html = splash:html(),
url = splash:url(),
}
end
'''
urls = Findurl(self.name)
for url in urls:
request = SplashRequest(url, self.parse,
args={"lua_source":lua_script}
)
yield request
After create request, I am printing the body like this;
def parse(self, response):
print(response)
print(response.text)
And here is the not readable output;
<html><head></head><body>�X�r����R�Z�e��$�S;�y�&{��&yP�� �)aL����+�c���_i�/+;U����A_N7u}�Po2��:Mn�ͧ�0���V6΁E7�)hf�K�r��.tܾ�;��R���Di��ݲV��I�<R���F�װ�3v��R�
� ��!Tm? g����br��Cu,%���]ٱ�y��Ea�Fl4��e!0�r|��^L��Km?~G퐅Kh�Rh%�2(!d�|d?!�+ȏvw�ږ)�H��:.n- �9>�����ڝ�ݙ��{���#�;�L*}�n�#���� �kΒv��OR��i���E���9λ��KH�-����C�X�`oy�x��A���2���*6�"a)�0
�o���f�������Nn�rtN�\���-��/�S��~`]�֗=+诃������S��N&�#�?
�(m���g�^�8�"��Y�A����y����Z_b�v����m�^�޴h��}��eYm-�p�~A\���h0��5��E��_�s�[�Zo�{�>���H
��ww�a0�M��n����I'�r�#㸻2쪳�;Lƭ?��Hh|��_��QS����{�6/����15Kt��#��O���n�'6B��u#�D|'.N����]�4&�R��=�h��M<i���}�b.�qh�m��:���G��F-�����<�wϰ��?�l;;#Ϫ�m1_�����+�O�>o�&'vJ��1�zo�f����հ�h��9���N�r]U���\y�"�7l6�¿���q!B����z�� ƨj�0�q�1��[��td_p2�="">lҹLZ�����ҩX ��Zϝ�x/A_%��eT$`�\���#���F�[����TN�[X{��M���Z�xHO�i�<�ey�M�;�V��y��BU,���"I\��G��p�%�Eы.���O�v��ڤVH�3�0��P��>����|#B#ʼ\��6�J>�t��B��,���������:��
��+b� .4�M��.#�:�+�X����[�J#yi����ۄ�0i�]�4���;
D8w�T�'غ��%vek�%xO�6��M�?
��7G�.q� ��R�F�ĥUh�pt�7G��I�ƾ�p��|�ƈ#�'O"r/�"�1FG�����H/y�`�a���d�HA.o�R�?�ڒzJ�hwRM�a�J�x̜#�u�>fd $���S11�� y=��N�)^�L�#��u��&��TY�"���(xR�S��N����E~�x,</i���}�b.�qh�m!��V����L;Z��}�w���jΥ����F�m�|�B�E�)���#U�<m��Χ���?�x_�. ��="" ��ah6g="" ��i�w�a="c�����p����Tc z��></g��3a��,li[iyb��5��></m��Χ���?�x_�.></body></html> :]�}��}����ֺ�ݱڭG9$" `q�0��i�="" f�?�v�d�i�r�x�*�="0 %��z���5��g�w.{�i�e��="">�.Gcޞ��>
I run splash with Docker compose and all tests works. I add following lines to my settings;
SPIDER_MIDDLEWARES = {
'crawler.crawler.middlewares.FirstBotSpiderMiddleware': 543,
'scrapy_splash.SplashDeduplicateArgsMiddleware': 100,
}
HTTPCACHE_STORAGE = 'scrapy_splash.SplashAwareFSCacheStorage'
DUPEFILTER_CLASS = 'scrapy_splash.SplashAwareDupeFilter'
DOWNLOADER_MIDDLEWARES = {
'scrapy_splash.SplashCookiesMiddleware': 723,
'scrapy_splash.SplashMiddleware': 725,
'scrapy.downloadermiddlewares.retry.RetryMiddleware': 90,
'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware': 110,
}
SPLASH_URL = 'http://localhost:8050'
I also test with Selenium and it crawls the page readable. But I need to crawl with Splash.
I try with proxy but the result is the same. I am new at splash. I don't get what am I missing?.

Scrapy - Splash fetch dynamic data

I am trying to fetch dynamic phone number from this page (among others): https://www.europages.fr/LEMMERFULLWOOD-GMBH/DEU241700-00101.html
The phone number appears after a click on the element div with the class page-action click-tel. I am trying to get to this data with scrapy_splash using a LUA script to execute a click.
After pulling splash on my ubuntu:
sudo docker run -d -p 8050:8050 scrapinghub/splash
Here is my code so far (I am using a proxy service) :
class company(scrapy.Spider):
name = "company"
custom_settings = {
"FEEDS" : {
'/home/ubuntu/scraping/europages/data/company.json': {
'format': 'jsonlines',
'encoding': 'utf8'
}
},
"DOWNLOADER_MIDDLEWARES" : {
'scrapy_splash.SplashCookiesMiddleware': 723,
'scrapy_splash.SplashMiddleware': 725,
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware': 810,
},
"SPLASH_URL" : 'http://127.0.0.1:8050/',
"SPIDER_MIDDLEWARES" : {
'scrapy_splash.SplashDeduplicateArgsMiddleware': 100,
},
"DUPEFILTER_CLASS" : 'scrapy_splash.SplashAwareDupeFilter',
"HTTPCACHE_STORAGE" : 'scrapy_splash.SplashAwareFSCacheStorage'
}
allowed_domains = ['www.europages.fr']
def __init__(self, company_url):
self.company_url = "https://www.europages.fr/LEMMERFULLWOOD-GMBH/DEU241700-00101.html" ##forced
self.item = company_item()
self.script = """
function main(splash)
splash.private_mode_enabled = false
assert(splash:go(splash.args.url))
assert(splash:wait(0.5))
local element = splash:select('.page-action.click-tel')
local bounds = element:bounds()
element:mouse_click{x=bounds.width/2, y=bounds.height/2}
splash:wait(4)
return splash:html()
end
"""
def start_requests(self):
yield scrapy.Request(
url = self.company_url,
callback = self.parse,
dont_filter = True,
meta = {
'splash': {
'endpoint': 'execute',
'url': self.company_url,
'args': {
'lua_source': self.script,
'proxy': 'http://usernamepassword#proxyhost:port',
'html':1,
'iframes':1
}
}
}
)
def parse(self, response):
soup = BeautifulSoup(response.body, "lxml")
print(soup.find('div',{'class','page-action click-tel'}))
The problem is that it has no effect, I still have nothing as if no button were clicked.
Shouldn't the return splash:html() return the results of element:mouse_click{x=bounds.width/2, y=bounds.height/2} (as element:mouse_click() waits for the changes to appear) in response.body ?
Am I missing something here ?
Most times when sites load data dynamically, they do so via background XHR requests to the server. A close examination of the network tab when you click the 'telephone' button, shows that the browser sends an XHR request to the url https://www.europages.fr/InfosTelecomJson.json?uidsid=DEU241700-00101&id=1330. You can emulate the same in your spider and avoid using scrapy splash altogether. See sample implementation below using one url:
import scrapy
from urllib.parse import urlparse
class Company(scrapy.Spider):
name = 'company'
allowed_domains = ['www.europages.fr']
start_urls = ['https://www.europages.fr/LEMMERFULLWOOD-GMBH/DEU241700-00101.html']
def parse(self, response):
# obtain the id and uuid to make xhr request
uuid = urlparse(response.url).path.split('/')[-1].rstrip('.html')
id = response.xpath("//div[#itemprop='telephone']/a/#onclick").re_first(r"event,'(\d+)',")
yield scrapy.Request(f"https://www.europages.fr/InfosTelecomJson.json?uidsid={uuid}&id={id}", callback=self.parse_address)
def parse_address(self, response):
yield response.json()
I get the response
{'digits': '+49 220 69 53 30'}

Scrapy-Splash Requested URL vs Real URL

I'm trying to make instantaneous reports for some of my webpages using the Splash and Scrapy-Splash python module.
The problem is that I can't obtain the corect last url like in splash render.json. When a website redirects.
For example on localhost:8050/render.json the result for rendering www.google.com is:
{"requestedUrl": "http://www.google.com/",
"url": "https://www.google.com/?gws_rd=ssl",
"title": "Google", "geometry": [0, 0, 1024, 768]}
But inside my python script I only manage to obtain "http://www.google.com"
My code is:
def start_requests(self):
return [Request(self.url, callback=self.parse, dont_filter=True)]
def parse(self, response):
splash_args = { 'wait': 1 }
return SplashRequest(
response.url,
self.parse_link,
args=splash_args,
endpoint='render.json',
)
def parse_link(self, response):
result = {
'requested_url': response.data['requestedUrl'],
'real_url': response.data['url'],
'response': response.request.url,
'splash_url': response.real_url
}
But any of this returns:
{"requested_url": "http://www.google.com/",
"real_url": "http://www.google.com/",
"response": "http://127.0.0.1:8050/render.json",
"splash_url": "http://127.0.0.1:8050/render.json"}
Is there any way to overcome this?

Python - Scrapy splash can't render this page

https://www.miamidade.realforeclose.com/index.cfm?zaction=AUCTION&Zmethod=PREVIEW&AUCTIONDATE=08/16/2018
This is the page that i'm trying to scrape. When i use SplashRequest to open it i get a different page with the same source.
Those are my settings for splas:
ROBOTSTXT_OBEY = False
SPLASH_URL = 'http://192.168.99.100:8050'
DOWNLOADER_MIDDLEWARES = {
'scrapy_splash.SplashCookiesMiddleware': 723,
'scrapy_splash.SplashMiddleware': 725,
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware':
810,
}
SPIDER_MIDDLEWARES = {
'scrapy_splash.SplashDeduplicateArgsMiddleware': 100,
}
DUPEFILTER_CLASS = 'scrapy_splash.SplashAwareDupeFilter'
My spider code:
import scrapy
from scrapy_splash import SplashRequest
class RealForeclosure(scrapy.Spider):
name = 'realForeclosure'
start_urls = [
'https://www.miamidade.realforeclose.com/index.cfm?
zaction=user&zmethod=calendar'
]
def parse(self,response):
link = 'https://www.miamidade.realforeclose.com/index.cfm?
zaction=AUCTION&Zmethod=PREVIEW&AUCTIONDATE='
date = response.xpath('//div[#tabindex="0"]/#dayid').extract()[10]
yield SplashRequest(link+date, callback=self.auction)
def auction(self, response):
for i in response.css('.AUCTION_ITEM').extract():
yield {'item':i}
You need some kind of delay to allow Splash render result:
script1 = """
function main(splash, args)
assert (splash:go(args.url))
assert (splash:wait(0.5))
return {
html = splash: html(),
png = splash:png(),
har = splash:har(),
}
end
"""
def parse(self,response):
link = 'https://www.miamidade.realforeclose.com/index.cfm?zaction=AUCTION&Zmethod=PREVIEW&AUCTIONDATE='
date = response.xpath('//div[#tabindex="0"]/#dayid').extract()[10]
yield SplashRequest(
link+date,
callback=self.auction,
endpoint='execute',
args={
'html': 1,
'lua_source': self.script1,
'wait': 0.5,
}
)

Scrapy: How to get cookies from splash

I am trying to get the cookies from a splash request, but I keep getting an error.
Here is the code I am using:
class P2PEye(scrapy.Spider):
name = 'p2peyeSpider'
allowed_domains = ['p2peye.com']
start_urls = ['https://www.p2peye.com/platform/h9/']
def start_requests(self):
script = '''
function main(splash)
local url = splash.args.url
assert(splash:go(url))
assert(splash:wait(0.5))
return {
cookies = splash:get_cookies(),
}
end
'''
for url in self.start_urls:
yield SplashRequest(url, callback=self.parse, endpoint='render.html',args={'wait': 1, 'lua_source': script})
def parse(self, response):
print(response.request.headers.getlist('Set-Cookie'))
print(response.cookiejar)
This is my settings.py
SPLASH_URL = 'http://127.0.0.1:8050'
CRAWLERA_ENABLED= False
DOWNLOADER_MIDDLEWARES = {
'scrapy_splash.SplashCookiesMiddleware': 723,
'scrapy_splash.SplashMiddleware': 725,
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware': 810,
}
SPIDER_MIDDLEWARES = {'scrapy_splash.SplashDeduplicateArgsMiddleware': 100 }
DUPEFILTER_CLASS = 'scrapy_splash.SplashAwareDupeFilter'
HTTPCACHE_STORAGE = 'scrapy_splash.SplashAwareFSCacheStorage'
COOKIES_ENABLED = True
COOKIES_DEBUG = True
SPLASH_COOKIES_DEBUG = True
The result of response.request.headers.getlist('Set-Cookie') is [],
and response.cookiejar got an error: AttributeError: 'SplashTextResponse' object has no attribute 'cookiejar'.
So how can I get the cookies without causing an error?
To access response.cookiejar you need to return SplashJsonResponse
try returning extra fields on your Lua script:
script = '''
function main(splash)
local url = splash.args.url
assert(splash:go(url))
assert(splash:wait(0.5))
local entries = splash:history()
local last_response = entries[#entries].response
return {
url = splash:url(),
headers = last_response.headers,
http_status = last_response.status,
cookies = splash:get_cookies(),
html = splash:html(),
}
end
'''
Using the LUA script below the response will be a dict with cookies located at key cookies
function main(splash)
local url = splash.args.url
assert(splash:go(url))
assert(splash:wait(0.5))
return {
cookies = splash:get_cookies(),
}
end
So to access you should use
# d = requests.post('splash').json()
print(d['cookies'])

Categories