Scrapyng AngularJS with Selenium in Python in Chrome headless mode - python

I want to crawl information from a webpage which is made with angularjs.
My problem is, that if I crawl the page in "--headless" mode I do not receive my target element. Without "--headless" everything works fine.
May somebody can explain or point a link what are the differences to "--headless"?
I red http://allselenium.info/wait-for-elements-python-selenium-webdriver/ . What else could be the matter?
Thank you for any hints.
EDIT:
It also doesn't work with wait conditions in headless mode

Here is a solution that worked for me after some research, reading:
https://github.com/GoogleChrome/puppeteer/issues/665
https://intoli.com/blog/making-chrome-headless-undetectable/
The headless request is detected, so one has to set arguments hiding headless mode:
options.add_argument('--headless')
options.add_argument('--lang=de-DE')
options.add_argument('--user-agent="Mozilla/5.0 (Windows NT 4.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/37.0.2049.0 Safari/537.36"')
options.add_argument("window-size=1920x1080")

Related

How to hide pyhon selenium works for browser

when a try to use a python Selenium, browser detect that this is a parser and i have not posibility to work with page. i can not get a page fully to make click() and other func.
!!! Fake-user-agent does not work
or pls suggest other methot to make a click on the page, for start video-player and get a link of button that located inside this player. i only need this link that always different that is link to download a video (JW player 8.26.5 )
You would need to bypass the detection by the website that you're using Selenium.
One way is using --disable-blink-features=AutomationControlled option when starting the browser. This tells the browser to not use the default user agent string, which is typically set to indicate that the browser is being controlled by automation software like Selenium.
Here is an example with chrome:
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
options = Options()
options.add_argument("--disable-blink-features=AutomationControlled")
driver = webdriver.Chrome(options=options)
driver.get("YOUR_URL")
You may also need to set a custom user-agent string to further disguise your browser as a non-automated user:
options.add_argument("user-agent=Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/89.0.4389.82 Safari/537.36")

How do I know which browser is used to crawl in Scrapy framework?

What's my context:
As you know, website HTML structure on Chrome, Firefox, Safari are quite different. So when I'm using CSS-Selector to get data in an element tag from HTML structure, sometimes It that tag is already have with Chrome browser but the other is not. So that, I just want to focus on only one browser to reduce my effort.
When I crawl data from urls by using Scrapy framework, I don't know which browser will be used by Scrapy to crawl data. Therefore, I also don't know what kind of HTML response body be returned. I checked the response and I found that sometimes the structure is the same as getting from Chrome but sometimes It's not. It seems that Scrapy framework used many different web browsers to crawl data.
What I want:
I want to use only Chrome browser for crawling data in Scrapy framework
The structure of the HTML response body must be obtained from Chrome
What I ask:
Does anyone have any Ideas or tips to help me deal with that issue?
Can I config the Webdriver in Scrapy Framework as Selenium does? (If It's possible, please show me Where and How?)
Thank you!
Scrapy does not use Browser, it parser for static html like BeautifulSoup. if you want to parse dynamic page (javascript generated) use selenium and if you want you can send the page source to Scrapy.
To set Scrapy to use custom user agent (Chrome), in settings.py add
USER_AGENT = Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.110 Safari/537.36
or in my_spider.py
class MySpider(scrapy.Spider):
def start_requests(self):
yield scrapy.Request(self.start_urls, callback=self.parse, headers={"User-Agent": "Your Custom User Agent"})
You can set the user agent in your setting file, something like this
USER_AGENT = 'User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3497.100 Safari/537.36'
So for web server it will look like the request is generating from Chrome.

Cannot open PhantomJS webpages in desktop mode (always in mobile mode)

I have been trying to fix this issue through stack overflow posts, but cannot find any relevant topics to my issue.
I am creating an automated python script that would automatically login to my facebook account and would utilize some features that facebook offers.
When I use selenium, I usually have the program run on the Chrome browser and I use the code as following
driver = webdriver.Chrome()
And I program the rest of the stuff that I want to do from there since it's easy to visually see whats going on with the program. However, when I switch to the PhantomJS browser, the program runs Facebook in a mobile version of the website (Like an android/ios version of Facebook). Here is an example of what it looks like
I was wondering if anyone would be able to help me in try understanding how to convert this into desktop mode, since the mobile version of Facebook is coded differently than the desktop version, and I don't want to redo the code for this difference. I need to have this running on PhantomJS, because it will be running on a low-powered raspberry pi device that can barely open google chrome.
I have also tried the following to see if it worked, and it didn't help.
headers = { 'Accept':'*/*',
'Accept-Encoding':'gzip, deflate, sdch',
'Accept-Language':'en-US,en;q=0.8',
'Cache-Control':'max-age=0',
'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36
(KHTML, like Gecko) Chrome/48.0.2564.116 Safari/537.36'
}
driver = webdriver.PhantomJS(desired_capabilities = headers)
driver.set_window_size(1366, 768)
Any help would be greatly appreciated!!
I had the same problem with PhantomJS Selenium and Python and next code was resolve it.
from selenium import webdriver
from selenium.webdriver import DesiredCapabilities
desired_capabilities = DesiredCapabilities.PHANTOMJS.copy()
desired_capabilities['phantomjs.page.customHeaders.User-Agent'] = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) ' \
'AppleWebKit/537.36 (KHTML, like Gecko) ' \
'Chrome/39.0.2171.95 Safari/537.36'
driver = webdriver.PhantomJS('./phantom/bin/phantomjs.exe', desired_capabilities=desired_capabilities)
driver.get('http://facebook.com')

Cookies not showing in Scrapy output even I have enabled it

By default Cookies are enabled in Python Scrapy
I have this in settings.py
COOKIES_DEBUG = True
It works in all other projects and shows cookies in terminal when I run code.
But it is not showing received cookies in terminal for a specific project.
I have searched internet but I am not sure what to do.
PS:
The website I am scraping of course sets cookies, I can see cookies when I visit that site from browser
What I can be missing?
From the dicsussions with OP, it appears that this website does not send Set-Cookie headers when using scrapy's default User-Agent string.
Changing the User-Agent string to something like this (in settings.py for example):
USER_AGENT = 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/54.0.2840.100 Safari/537.36)'
fixes the issue.

Scrapy crawl blocked with 403/503

I'm running Scrapy 0.24.4, and have encountered quite a few sites that shut down the crawl very quickly, typically within 5 requests. The sites return 403 or 503 for every request, and Scrapy gives up. I'm running through a pool of 100 proxies, with the RotateUserAgentMiddleware enabled.
Does anybody know how a site could identify Scrapy that quickly, even with the proxies and user agents changing? Scrapy doesn't add anything to the request headers that gives it away, does it?
Some sites incorporate javascript code that needs to be run.
Scrapy doesn't execute javascript code so the web app really quickly knows it's a bot.
http://scraping.pro/javascript-protected-content-scrape/
Try using selenium for those sites that return 403. If crawling with selenium works, you can assume that problem is in javascript.
I think crunchbase.com uses such protection against scraping.
It appears that the primary problem was not having cookies enabled. Having enabled cookies, I'm having more success now. Thanks.
For me cookies were already enabled.
What fixed it was using another user agent, one that is common.
Replace in settings.py file of your project USER_AGENT with this:
USER_AGENT = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.169 Safari/537.36'
I simply set AutoThrottle_ENABLED to True and my script was able to run.

Categories