Scrapy - Splash fetch dynamic data

Scrapy - Splash fetch dynamic data - python

I am trying to fetch dynamic phone number from this page (among others): https://www.europages.fr/LEMMERFULLWOOD-GMBH/DEU241700-00101.html
The phone number appears after a click on the element div with the class page-action click-tel. I am trying to get to this data with scrapy_splash using a LUA script to execute a click.
After pulling splash on my ubuntu:
sudo docker run -d -p 8050:8050 scrapinghub/splash
Here is my code so far (I am using a proxy service) :
class company(scrapy.Spider):
name = "company"
custom_settings = {
"FEEDS" : {
'/home/ubuntu/scraping/europages/data/company.json': {
'format': 'jsonlines',
'encoding': 'utf8'
}
},
"DOWNLOADER_MIDDLEWARES" : {
'scrapy_splash.SplashCookiesMiddleware': 723,
'scrapy_splash.SplashMiddleware': 725,
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware': 810,
},
"SPLASH_URL" : 'http://127.0.0.1:8050/',
"SPIDER_MIDDLEWARES" : {
'scrapy_splash.SplashDeduplicateArgsMiddleware': 100,
},
"DUPEFILTER_CLASS" : 'scrapy_splash.SplashAwareDupeFilter',
"HTTPCACHE_STORAGE" : 'scrapy_splash.SplashAwareFSCacheStorage'
}
allowed_domains = ['www.europages.fr']
def __init__(self, company_url):
self.company_url = "https://www.europages.fr/LEMMERFULLWOOD-GMBH/DEU241700-00101.html" ##forced
self.item = company_item()
self.script = """
function main(splash)
splash.private_mode_enabled = false
assert(splash:go(splash.args.url))
assert(splash:wait(0.5))
local element = splash:select('.page-action.click-tel')
local bounds = element:bounds()
element:mouse_click{x=bounds.width/2, y=bounds.height/2}
splash:wait(4)
return splash:html()
end
"""
def start_requests(self):
yield scrapy.Request(
url = self.company_url,
callback = self.parse,
dont_filter = True,
meta = {
'splash': {
'endpoint': 'execute',
'url': self.company_url,
'args': {
'lua_source': self.script,
'proxy': 'http://usernamepassword#proxyhost:port',
'html':1,
'iframes':1
}
}
}
)
def parse(self, response):
soup = BeautifulSoup(response.body, "lxml")
print(soup.find('div',{'class','page-action click-tel'}))
The problem is that it has no effect, I still have nothing as if no button were clicked.
Shouldn't the return splash:html() return the results of element:mouse_click{x=bounds.width/2, y=bounds.height/2} (as element:mouse_click() waits for the changes to appear) in response.body ?
Am I missing something here ?

Most times when sites load data dynamically, they do so via background XHR requests to the server. A close examination of the network tab when you click the 'telephone' button, shows that the browser sends an XHR request to the url https://www.europages.fr/InfosTelecomJson.json?uidsid=DEU241700-00101&id=1330. You can emulate the same in your spider and avoid using scrapy splash altogether. See sample implementation below using one url:
import scrapy
from urllib.parse import urlparse
class Company(scrapy.Spider):
name = 'company'
allowed_domains = ['www.europages.fr']
start_urls = ['https://www.europages.fr/LEMMERFULLWOOD-GMBH/DEU241700-00101.html']
def parse(self, response):
# obtain the id and uuid to make xhr request
uuid = urlparse(response.url).path.split('/')[-1].rstrip('.html')
id = response.xpath("//div[#itemprop='telephone']/a/#onclick").re_first(r"event,'(\d+)',")
yield scrapy.Request(f"https://www.europages.fr/InfosTelecomJson.json?uidsid={uuid}&id={id}", callback=self.parse_address)
def parse_address(self, response):
yield response.json()
I get the response
{'digits': '+49 220 69 53 30'}

Related

Why Splash response is not readable after 200 response?

I have a js rendering website and while I am scraping, I am getting 200 response as follow;
<200 https://www.sokmarket.com.tr/mevsim-sebzeleri-c-1821>
Here is my request with Splash;
def start_requests(self):
lua_script = '''
function main(splash)
local url = splash.args.url
assert(splash:go(url))
assert(splash:wait(5))
return {
html = splash:html(),
url = splash:url(),
}
end
'''
urls = Findurl(self.name)
for url in urls:
request = SplashRequest(url, self.parse,
args={"lua_source":lua_script}
)
yield request
After create request, I am printing the body like this;
def parse(self, response):
print(response)
print(response.text)
And here is the not readable output;
<html><head></head><body>�X�r����R�Z�e��$�S;�y�&{��&yP�� �)aL����+�c���_i�/+;U����A_N7u}�Po2��:Mn�ͧ�0���V6΁E7�)hf�K�r��.tܾ�;��R���Di��ݲV��I�<R���F�װ�3v��R�
� ��!Tm? g����br��Cu,%���]ٱ�y��Ea�Fl4��e!0�r|��^L��Km?~G퐅Kh�Rh%�2(!d�|d?!�+ȏvw�ږ)�H��:.n- �9>�����ڝ�ݙ��{���#�;�L*}�n�#���� �kΒv��OR��i���E���9λ��KH�-����C�X�`oy�x��A���2���*6�"a)�0
�o���f�������Nn�rtN�\���-��/�S��~`]�֗=+诃������S��N&�#�?
�(m���g�^�8�"��Y�A����y����Z_b�v����m�^�޴h��}��eYm-�p�~A\���h0��5��E��_�s�[�Zo�{�>���H
��ww�a0�M��n����I'�r�#㸻2쪳�;Lƭ?��Hh|��_��QS����{�6/����15Kt��#��O���n�'6B��u#�D|'.N����]�4&�R��=�h��M<i���}�b.�qh�m��:���G��F-�����<�wϰ��?�l;;#Ϫ�m1_�����+�O�>o�&'vJ��1�zo�f����հ�h��9���N�r]U���\y�"�7l6�¿���q!B����z�� ƨj�0�q�1��[��td_p2�="">lҹLZ�����ҩX ��Zϝ�x/A_%��eT$`�\���#���F�[����TN�[X{��M���Z�xHO�i�<�ey�M�;�V��y��BU,���"I\��G��p�%�Eы.���O�v��ڤVH�3�0��P��>����|#B#ʼ\��6�J>�t��B��,���������:��
��+b� .4�M��.#�:�+�X����[�J#yi����ۄ�0i�]�4���;
D8w�T�'غ��%vek�%xO�6��M�?
��7G�.q� ��R�F�ĥUh�pt�7G��I�ƾ�p��|�ƈ#�'O"r/�"�1FG�����H/y�`�a���d�HA.o�R�?�ڒzJ�hwRM�a�J�x̜#�u�>fd $���S11�� y=��N�)^�L�#��u��&��TY�"���(xR�S��N����E~�x,</i���}�b.�qh�m!��V����L;Z��}�w���jΥ����F�m�|�B�E�)���#U�<m��Χ���?�x_�. ��="" ��ah6g="" ��i�w�a="c�����p����Tc z��></g��3a��,li[iyb��5��></m��Χ���?�x_�.></body></html> :]�}��}����ֺ�ݱڭG9$" `q�0��i�="" f�?�v�d�i�r�x�*�="0 %��z���5��g�w.{�i�e��="">�.Gcޞ��>
I run splash with Docker compose and all tests works. I add following lines to my settings;
SPIDER_MIDDLEWARES = {
'crawler.crawler.middlewares.FirstBotSpiderMiddleware': 543,
'scrapy_splash.SplashDeduplicateArgsMiddleware': 100,
}
HTTPCACHE_STORAGE = 'scrapy_splash.SplashAwareFSCacheStorage'
DUPEFILTER_CLASS = 'scrapy_splash.SplashAwareDupeFilter'
DOWNLOADER_MIDDLEWARES = {
'scrapy_splash.SplashCookiesMiddleware': 723,
'scrapy_splash.SplashMiddleware': 725,
'scrapy.downloadermiddlewares.retry.RetryMiddleware': 90,
'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware': 110,
}
SPLASH_URL = 'http://localhost:8050'
I also test with Selenium and it crawls the page readable. But I need to crawl with Splash.
I try with proxy but the result is the same. I am new at splash. I don't get what am I missing?.

scrapy-playwright:- Downloader/handlers: scrapy.exceptions.NotSupported: AsyncioSelectorReactor

I tried to extract some data from dynamically loaded javascript website using scrapy-playwright but I stuck at the very beginning.
From where I'm facing trubles in settings.py file is as follows:
#playwright
DOWNLOAD_HANDLERS = {
"http": "scrapy_playwright.handler.ScrapyPlaywrightDownloadHandler",
"https": "scrapy_playwright.handler.ScrapyPlaywrightDownloadHandler",
}
#TWISTED_REACTOR = 'twisted.internet.asyncioreactor.AsyncioSelectorReactor'
#ASYNCIO_EVENT_LOOP = 'uvloop.Loop'
When I inject the following scrapy-playwright hanndler:
DOWNLOAD_HANDLERS = {
"http": "scrapy_playwright.handler.ScrapyPlaywrightDownloadHandler",
"https": "scrapy_playwright.handler.ScrapyPlaywrightDownloadHandler",
}
Then I got:
scrapy.exceptions.NotSupported: Unsupported URL scheme 'https': The installed reactor
(twisted.internet.selectreactor.SelectReactor) does not match the requested one (twisted.internet.asyncioreactor.AsyncioSelectorReactor)
When I inject TWISTED_REACTOR"
TWISTED_REACTOR = 'twisted.internet.asyncioreactor.AsyncioSelectorReactor'
Then I got:
raise TypeError(
TypeError: SelectorEventLoop required, instead got: <ProactorEventLoop running=False closed=False debug=False>
After all,When I inject ASYNCIO_EVENT_LOOP
Then I got:
ModuleNotFoundError: No module named 'uvloop'
At last, fail to install 'uvloop'
pip install uvloop
Script
import scrapy
from scrapy_playwright.page import PageCoroutine
class ProductSpider(scrapy.Spider):
name = 'product'
def start_requests(self):
yield scrapy.Request(
'https://shoppable-campaign-demo.netlify.app/#/',
meta={
'playwright': True,
'playwright_include_page': True,
'playwright_page_coroutines': [
PageCoroutine("wait_for_selector", "div#productListing"),
]
}
)
async def parse(self, response):
pass
# parses content

It's been suggested by the developers of scrapy_playwright to instantiate the DOWNLOAD_HANDLERS and TWISTER_REACTOR into your script.
A similar comment is provided here
Here's a working script implementing just this:
import scrapy
from scrapy_playwright.page import PageCoroutine
from scrapy.crawler import CrawlerProcess
class ProductSpider(scrapy.Spider):
name = 'product'
def start_requests(self):
yield scrapy.Request(
'https://shoppable-campaign-demo.netlify.app/#/',
callback = self.parse,
meta={
'playwright': True,
'playwright_include_page': True,
'playwright_page_coroutines': [
PageCoroutine("wait_for_selector", "div#productListing"),
]
}
)
async def parse(self, response):
container = response.xpath("(//div[#class='col-md-6'])[1]")
for items in container:
yield {
'products':items.xpath("(//h3[#class='card-title'])[1]//text()").get()
}
# parses content
if __name__ == "__main__":
process = CrawlerProcess(
settings={
"TWISTED_REACTOR": "twisted.internet.asyncioreactor.AsyncioSelectorReactor",
"DOWNLOAD_HANDLERS": {
"https": "scrapy_playwright.handler.ScrapyPlaywrightDownloadHandler",
"http": "scrapy_playwright.handler.ScrapyPlaywrightDownloadHandler",
},
"CONCURRENT_REQUESTS": 32,
"FEED_URI":'Products.jl',
"FEED_FORMAT":'jsonlines',
}
)
process.crawl(ProductSpider)
process.start()
And we get the following output:
{'products': 'Oxford Loafers'}

If you are using Windows then your problem is that Playwright doesn't support Windows. Check it out here https://github.com/scrapy-plugins/scrapy-playwright/issues/154

how do i scrape this kind of dynamic generated website data?

I'm trying to scrape E-commerce website,
example link: https://www.lazada.sg/products/esogoal-2-in-1-selfie-stick-tripod-bluetooth-selfie-stand-with-remote-shutter-foldable-tripod-monopod-i279432816-s436738661.html?mp=1
Data is being rendered via React and when i perform scraping on few links most of the data is being returned as null, and when i view the page source i cannot find actually HTML that is available via inspect element, just a json inside Javascript tags. I tested few times running scrapy scraper on the same links and data which was not found before, actually returns content, so its somehow randomly. I cannot figure out how should i scrape this kind of website.
As well i'm using pool of useragents and breaks between requests.
script = '''
function main(splash, args)
assert(splash:go(args.url))
assert(splash:wait(1.5))
return splash:html()
end
'''
def start_requests(self):
url= [
'https://www.lazada.sg/products/esogoal-tactical-sling-bag-outdoor-chest-pack-shoulder-backpack-military-sport-bag-for-trekking-camping-hiking-rover-sling-daypack-for-men-women-i204814494-s353896924.html?mp=1',
'https://www.lazada.sg/products/esogoal-2-in-1-selfie-stick-tripod-bluetooth-selfie-stand-with-remote-shutter-foldable-tripod-monopod-i279432816-s436738661.html?mp=1',
'https://www.lazada.sg/products/esogoal-selfie-stick-tripod-extendable-selfie-stick-monopod-with-integrated-tripod-and-bluetooth-remote-shutter-wireless-selfie-stick-tripod-for-cellphonecameras-i205279097-s309050125.html?mp=1',
'https://www.lazada.sg/products/esogoal-mini-umbrella-travel-umbrella-sun-rain-umbrella8-ribs-98cm-big-surface-lightweight-compact-parasol-uv-protection-for-men-women-i204815487-s308312226.html?mp=1',
'https://www.lazada.sg/products/esogoal-2-in-1-selfie-stick-tripod-bluetooth-selfie-stand-with-remote-shutter-foldable-tripod-monopod-i279432816-s436738661.html?mp=1'
]
for link in url:
yield SplashRequest(url=link, callback=self.parse, endpoint='render.html', args={'wait' : 0.5, 'lua_source' : self.script}, dont_filter=True)
def parse(self, response):
yield {
'title' : response.xpath("//span[#class='pdp-mod-product-badge-title']/text()").extract_first(),
'price' : response.xpath("//span[contains(#class, 'pdp-price')]/text()").extract_first(),
'description' : response.xpath("//div[#id='module_product_detail']").extract_first()
}

I try this:
Pass 'execute' as argument of the splash method instead of 'render html'
from scrapy_splash import SplashRequest
class DynamicSpider(scrapy.Spider):
name = 'products'
url = [
'https://www.lazada.sg/products/esogoal-tactical-sling-bag-outdoor-chest-pack-shoulder-backpack-military-sport-bag-for-trekking-camping-hiking-rover-sling-daypack-for-men-women-i204814494-s353896924.html?mp=1',
'https://www.lazada.sg/products/esogoal-2-in-1-selfie-stick-tripod-bluetooth-selfie-stand-with-remote-shutter-foldable-tripod-monopod-i279432816-s436738661.html?mp=1',
'https://www.lazada.sg/products/esogoal-selfie-stick-tripod-extendable-selfie-stick-monopod-with-integrated-tripod-and-bluetooth-remote-shutter-wireless-selfie-stick-tripod-for-cellphonecameras-i205279097-s309050125.html?mp=1',
'https://www.lazada.sg/products/esogoal-mini-umbrella-travel-umbrella-sun-rain-umbrella8-ribs-98cm-big-surface-lightweight-compact-parasol-uv-protection-for-men-women-i204815487-s308312226.html?mp=1',
'https://www.lazada.sg/products/esogoal-2-in-1-selfie-stick-tripod-bluetooth-selfie-stand-with-remote-shutter-foldable-tripod-monopod-i279432816-s436738661.html?mp=1',
]
script = """
function main(splash, args)
assert(splash:go(args.url))
assert(splash:wait(1.5))
return {
html = splash:html()
}
end
"""
def start_requests(self):
for link in self.url:
yield SplashRequest(
url=link,
callback=self.parse,
endpoint='execute',
args={'wait': 0.5, 'lua_source': self.script},
dont_filter=True,
)
def parse(self, response):
yield {
'title': response.xpath("//span[#class='pdp-mod-product-badge-title']/text()").extract_first(),
'price': response.xpath("//span[contains(#class, 'pdp-price')]/text()").extract_first(),
'description': response.xpath("//div[#id='module_product_detail']/h2/text()").extract_first()
}
An this is the result

Scrapy-Splash Requested URL vs Real URL

I'm trying to make instantaneous reports for some of my webpages using the Splash and Scrapy-Splash python module.
The problem is that I can't obtain the corect last url like in splash render.json. When a website redirects.
For example on localhost:8050/render.json the result for rendering www.google.com is:
{"requestedUrl": "http://www.google.com/",
"url": "https://www.google.com/?gws_rd=ssl",
"title": "Google", "geometry": [0, 0, 1024, 768]}
But inside my python script I only manage to obtain "http://www.google.com"
My code is:
def start_requests(self):
return [Request(self.url, callback=self.parse, dont_filter=True)]
def parse(self, response):
splash_args = { 'wait': 1 }
return SplashRequest(
response.url,
self.parse_link,
args=splash_args,
endpoint='render.json',
)
def parse_link(self, response):
result = {
'requested_url': response.data['requestedUrl'],
'real_url': response.data['url'],
'response': response.request.url,
'splash_url': response.real_url
}
But any of this returns:
{"requested_url": "http://www.google.com/",
"real_url": "http://www.google.com/",
"response": "http://127.0.0.1:8050/render.json",
"splash_url": "http://127.0.0.1:8050/render.json"}
Is there any way to overcome this?

Scrapy: How to get cookies from splash

I am trying to get the cookies from a splash request, but I keep getting an error.
Here is the code I am using:
class P2PEye(scrapy.Spider):
name = 'p2peyeSpider'
allowed_domains = ['p2peye.com']
start_urls = ['https://www.p2peye.com/platform/h9/']
def start_requests(self):
script = '''
function main(splash)
local url = splash.args.url
assert(splash:go(url))
assert(splash:wait(0.5))
return {
cookies = splash:get_cookies(),
}
end
'''
for url in self.start_urls:
yield SplashRequest(url, callback=self.parse, endpoint='render.html',args={'wait': 1, 'lua_source': script})
def parse(self, response):
print(response.request.headers.getlist('Set-Cookie'))
print(response.cookiejar)
This is my settings.py
SPLASH_URL = 'http://127.0.0.1:8050'
CRAWLERA_ENABLED= False
DOWNLOADER_MIDDLEWARES = {
'scrapy_splash.SplashCookiesMiddleware': 723,
'scrapy_splash.SplashMiddleware': 725,
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware': 810,
}
SPIDER_MIDDLEWARES = {'scrapy_splash.SplashDeduplicateArgsMiddleware': 100 }
DUPEFILTER_CLASS = 'scrapy_splash.SplashAwareDupeFilter'
HTTPCACHE_STORAGE = 'scrapy_splash.SplashAwareFSCacheStorage'
COOKIES_ENABLED = True
COOKIES_DEBUG = True
SPLASH_COOKIES_DEBUG = True
The result of response.request.headers.getlist('Set-Cookie') is [],
and response.cookiejar got an error: AttributeError: 'SplashTextResponse' object has no attribute 'cookiejar'.
So how can I get the cookies without causing an error?

To access response.cookiejar you need to return SplashJsonResponse
try returning extra fields on your Lua script:
script = '''
function main(splash)
local url = splash.args.url
assert(splash:go(url))
assert(splash:wait(0.5))
local entries = splash:history()
local last_response = entries[#entries].response
return {
url = splash:url(),
headers = last_response.headers,
http_status = last_response.status,
cookies = splash:get_cookies(),
html = splash:html(),
}
end
'''

Using the LUA script below the response will be a dict with cookies located at key cookies
function main(splash)
local url = splash.args.url
assert(splash:go(url))
assert(splash:wait(0.5))
return {
cookies = splash:get_cookies(),
}
end
So to access you should use
# d = requests.post('splash').json()
print(d['cookies'])

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Scrapy - Splash fetch dynamic data - python

Related

Why Splash response is not readable after 200 response?

scrapy-playwright:- Downloader/handlers: scrapy.exceptions.NotSupported: AsyncioSelectorReactor

how do i scrape this kind of dynamic generated website data?

Scrapy-Splash Requested URL vs Real URL

Scrapy: How to get cookies from splash

Categories

Resources