Godaddy website cannot be scraped with Selenium Python

Godaddy website cannot be scraped with Selenium Python - python

I attempted to scrape the Godaddy website but was unsuccessful.
When I conduct a search for a particular name, it comes up "We were unable to complete your search. Please try again." and a standard browser can use it just well (without selenium).
Sorry if I've inconvenienced you with this query; this is my first time doing it.
Url to scrape : https://in.godaddy.com/domainsearch/find?checkAvail=1&domainToCheck=bjmtuc.club
from selenium import webdriver
chrome_options = webdriver.ChromeOptions()
chrome_options.set_capability("goog:loggingPrefs", {"performance": "ALL", "browser": "ALL"})
chrome_options.add_argument("user-agent=Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/107.0.0.0 Safari/537.36")
driver_path = 'drive_path'
service = Service(driver_path) # service path set
driver = webdriver.Chrome(service=service, options=chrome_options) # working fine
url = 'https://in.godaddy.com/domainsearch/find?checkAvail=1&domainToCheck=bjmtuc.club'
driver.get(url)
Edit 1:
It said "Try disabling ad blockers and other extensions, enabling javascript, or using a different web browser." when I created a fake account and tried to log in.
Since I don't use any extensions or ad blockers, I tried turning on javascript right away. It stated the same thing and did not work.
Code added:
chrome_options.add_argument("javascript.enabled")

Looks like the search doesn't work for unauthenticated users. Try opening your url manually from the incognito mode (or just logout).
If you open this link as authenticated user you get the result of the search
But if you logout and then try to open this link you recieve "We were unable to complete your search. Please try again."
So, when you open this link manually it shows the result because you are authenticated on the site. And even if you reopen your browser and go open this link - the search result will be shown because you are still authenticated on the site. This authentication is saved into browser sessions which are stored on your computer only for your user.
But when you try to open this link using Selenium, it opens browser with its (selenium's) own saved sessions. But you didn't login into the site with Selenium, so there are no authenticated sessions. Furthermore, each time Selenium opens a browser, it creates new sessions.
So, possible solutions are:
Add login steps before opening the link
Configure selenium to use your saved sessions
Make a separate code for login into the site (or open the site with Selenium but login manually) and create a place where sessions for Selenium will be stored. And then configure all your tests to use this place for using saved sessions.

Related

Selenium profile import -chrome [duplicate]

I am attempting to load a chrome browser with selenium using my existing account and settings from my profile.
I can get this working using ChromeOptions to set the userdatadir and profile directory. This loads the browser with my profile like i want, but the browser then hangs for 60 seconds and times out without advancing through any more of the automation.
If I don't use the user data dir and profile settings, it works fine but doesn't use my profile.
The reading I've done points to not being able to have more than one browser open at a time with the same profile so I made sure nothing was open while I ran the program. It still hangs for 60 seconds even without another browser open.
m_Options = new ChromeOptions();
m_Options.AddArgument("--user-data-dir=C:/Users/Me/AppData/Local/Google/Chrome/User Data");
m_Options.AddArgument("--profile-directory=Default");
m_Options.AddArgument("--disable-extensions");
m_Driver = new ChromeDriver(#"pathtoexe", m_Options);
m_Driver.Navigate().GoToUrl("somesite");
It always hangs on the GoToUrl. I'm not sure what else to try.

As per your code trials you were trying to load the Default Chrome Profile which will be against all the best practices as the Default Chrome Profile may contain either of the following:
Extensions
Bookmarks
Browsing History
etc
So the Default Chrome Profile may not be in compliance with you Test Specification and may raise exception while loading. Hence you should always use a customized Chrome Profile as below.
To create and open a new Chrome Profile you need to follow the following steps :
Open Chrome browser, click on the Side Menu and click on Settings on which the url chrome://settings/ opens up.
In People section, click on Manage other people on which a popup comes up.
Click on ADD PERSON, provide the person name, select an icon, keep the item Create a desktop shortcut for this user checked and click on ADD button.
Your new profile gets created.
Snapshot of a new profile SeLeNiUm
Now a desktop icon will be created as SeLeNiUm - Chrome
From the properties of the desktop icon SeLeNiUm - Chrome get the name of the profile directory. e.g. --profile-directory="Profile 2"
Get the absolute path of the profile-directory in your system as follows :
C:\\Users\\Thranor\\AppData\\Local\\Google\\Chrome\\User Data\\Profile 2
Now pass the value of profile-directory through an instance of ChromeOptions with AddArgument method along with key user-data-dir as follows :
m_Options = new ChromeOptions();
m_Options.AddArgument("--user-data-dir=C:/Users/Me/AppData/Local/Google/Chrome/User Data/Profile 2");
m_Options.AddArgument("--disable-extensions");
m_Driver = new ChromeDriver(#"pathtoexe", m_Options);
m_Driver.Navigate().GoToUrl("somesite");
Execute your Test
Observe Chrome gets initialized with the Chrome Profile as SeLeNiUm

If you want to run Chrome using your default profile (cause you need a extension), you need to run your script using another browser, like Microsoft Edge or Microsoft IE and your code will lunch a Chrome instance.
My Code in PHP:
namespace Facebook\WebDriver;
use Facebook\WebDriver\Remote\DesiredCapabilities;
use Facebook\WebDriver\Remote\RemoteWebDriver;
use Facebook\WebDriver\Chrome\ChromeOptions;
require_once('vendor/autoload.php');
$host = 'http://localhost:4444/';
$options = new ChromeOptions();
$options->addArguments(array(
'--user-data-dir=C:\Users\paulo\AppData\Local\Google\Chrome\User Data',
'--profile-directory=Default',
'--user-agent=Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.88 Safari/537.36'
));
$caps = DesiredCapabilities::chrome();
$caps->setCapability(ChromeOptions::CAPABILITY, $options);
$caps->setPlatform("Windows");
$driver = RemoteWebDriver::create($host, $caps);
$driver ->manage()->window()->maximize();
$driver->get('https://www.google.com/');
// your code goes here.
$driver->quit();

i guys, in my enviroment with chrome 63 and selenum for control, i have find same problem (60 second on wait for open webpage).
To fix i have find a way by setting a default webpage in chrome ./[user-data-dir]/[Profile]/Preferences file, this is a json data need to insert in "Preferences" file for obtain result
...
"session":{
"restore_on_startup":4,
"startup_urls":[
"http://localhost/test1"
]
}
...
For set "Preferences" from selenium i have use this sample code
ChromeOptions chromeOptions = new ChromeOptions();
//set my user data dir
chromeOptions.addArguments("--user-data-dir=/usr/chromeDataDir/");
//start create data structure to for insert json in "Preferences" file
Map<String, Object> prefs = new HashMap<String, Object>();
prefs.put("session.restore_on_startup", 4);
List<String> urlList = new ArrayList<String>();
urlList.add("http://localhost/test1");
prefs.put("session.startup_urls", urlList);
//set in chromeOptions data structure
chromeOptions.setExperimentalOption("prefs", prefs);
//start chrome
ChromeDriver chromeDriver = new ChromeDriver(chromeOptions);
//this get command for open web page, response instant
chromeDriver.get("http://localhost/test2")
i have find information here https://chromedriver.chromium.org/capabilities

Wait for XHR Request and/or a Console message for Selenium in Python

I'm trying to make selenium capture the page source after it has fully rendered, if I go to the page and capture straight away only a bit of the page has rendered, if I put in a sleep of 30 seconds it fully renders but I want it to be more efficient.
If we use https://twitter.com/i/notifications as an example, you'll see that after 5~ seconds after the page loads there is a toast_poll and a timeline XHR request.
I want to be able to detect one of these requests and wait until one fires, then that is an indicator that the page has loaded fully.
The site that I am using fires console.log("Done") so if I could detect the console commands in PhantomJS & Firefox then this would be an even better choice than waiting for an XHR request, just wait until Done appears in the console and then that is the indicator that the page has loaded fully.
Regarding the Duplicate Flagging of this Post:
This question is in regards to PhantomJS and Firefox, the post Detect javascript console output with python is from over a year ago and the answer given only works on Chrome, I am looking for a PhantomJS and Firefox option, which I already think based on StackOverflow isn't possible so that's why my start of my post is about waiting for an XHR request.
I've already tried the following code, but it doesn't work for me.. I get zero response even though the website is throwing a console.log("Done")
from seleniumrequests import PhantomJS
from seleniumrequests import Firefox
from selenium import webdriver
import os
webdriver.DesiredCapabilities.PHANTOMJS['phantomjs.page.customHeaders.User-Agent'] = 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Ubuntu Chromium/37.0.2062.120 Chrome/37.0.2062.120 Safari/537.36'
webdriver.DesiredCapabilities.PHANTOMJS['loggingPrefs'] = { 'browser':'ALL' }
browser = PhantomJS(executable_path="phantomjs.exe", service_log_path=os.path.devnull)
browser = webdriver.Firefox()
browser.set_window_size(1400, 1000)
url = "https://website.com"
browser.get(url)
for entry in browser.get_log('browser'):
print entry
I'm unable to test with browser = webdriver.Firefox() commented out because I am not sure how to have two lots of DesiredCapabilities set.

You could override the console.log function and wait for the "Done" message with execute_async_script:
from selenium import webdriver
driver = webdriver.Firefox()
driver.set_script_timeout(10)
driver.get("...")
# wait for console.log("Done") to be called
driver.execute_async_script("""
var callback = arguments[0];
console.log = function(message) {
if(message === "Done")
callback();
};
""")

How to handle javascript content and redirects after successful weblogin SSO authentication?

I am writing a python script that downloads class content(mp4, pdf) from my school website. My school uses Weblogin SSO authentication to access any of their protected urls.
I was able to authenticate my credentials using the first part of the script below:
#1. Authenticate
login_url = "https://weblogin.MY_SCHOOL.edu/login"
payload = {'login':'my_loging','password':'my_pass'}
target_url = "https://My_SCHOOL.instructure.com/courses/12345678""
with requests.Session() as c:
req_headers = {'User-Agent':'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/44.0.2403.157 Safari/537.36}'}
c.headers.update(req_headers)
c.get(login_url) # to get cookies
c.post(url1, data=payload) #,headers = req_headers)
#2. get html from target site
W1 = c.get(target_url)
print(W1.url)
print(W1.text)
#3. parse html and download content.
#tbc
I can see that my authentication was successful in c.post.text, but whenver I try to access any of the target sites using get() in same requests.session(), I don't get the expected html content for my class, but rather a message that reads:
"Since your browser does not support JavaScript, you must press the
Continue button once to proceed"
And the target URL redirects to this url:
"https://idp.MY_SCHOOL.edu/idp/profile/SAML2/Redirect/SSO"
Why am I not able to access the target url(s) after a successful SSO authentication? I am not sure if javascript support in the requests module is the issue here because even when I disable JS support in my internet browser, I am able to see some html content of the target_url, albeit not all of it. It seems strange that my get() request gets stuck in the redirected url: "https:.../SAML2/Redirect/SSO"
I would appreciate any pointers on how to go around this issue. I wouldn't want to use webdrivers such as selenium or mechanize. I have used QtWebkit to render Javascript content, but I don't know if it is even possible to transfer my authentication cookies from my request.session() to QtWebkit.
Any help is much appreciated. Thanks

I'm not an expert in SSO but I think I know what's going on. In a typical situation, your browser will post your login credentials to the login url. The response will be an html page containing a form. The form will contain your SSO token. Within the html page, embedded javascript will submit the form to the application you are trying to access. The application will validate the token and then grant you access. When javascript in enabled, this happens seamlessly. If you turn off javascript in your browser and try to login, you will the same message and you will have to press a button in order to submit the form containing your token. To do this via script, you will have to likely have to parse the form, get the token value, and then post it yourself.

Python clicking a button on webpage (back end)

I am trying using Python to simulate the login to company email.
The scripts works find and result page shows an already login sign.
import urllib
import urllib2
import mechanize
import cookielib
br = mechanize.Browser()
cj = cookielib.LWPCookieJar()
br.set_cookiejar(cj)
br.set_handle_equiv(True)
br.set_handle_redirect(True)
br.set_handle_referer(True)
br.set_handle_robots(False)
br.set_handle_refresh(mechanize._http.HTTPRefreshProcessor(), max_time=1)
br.addheaders = [('User-agent', 'Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; WOW64; Trident/5.0)')]
url = 'http://www.company.com/'
form_data = {'name_file': 'THENAME', 'password_field': 'THEPASSWORD'}
params = urllib.urlencode(form_data)
response = br.open(url, params)
However I need to click the “GoTo Email” button on the webpage to enter the email dashboard. Note here the web address doesn’t change, and not redirecting to any other page, before and after clicking the button.
The HTML script is showed as below. I think it’s a button.
id=mail1_btnGoto class=btn onmouseover="this.className='btnOver'" onmouseout="this.className='btn'" name=mail1$btnGoto value="GoTo Email" type=submit>
I thought to use winapi to simulate a mouse click but it’s silly because it controls mouse at the front end only. Selenium isn’t a solution in this case because I want to have the script run at the back end.
How can I have the button ‘click’ on the webpage?

It seems the email dashboard is driven by Javascript, so you cannot simply use winapi to simulate mouseclick without evaluating script.
Generally there are two workarounds:
Use full-feature browser driver. As you mentioned above, selenium is a good choice across many programming language. The webdriver does not need opening browsers manually and can be fully controlled by scripts. You can try ghost driver instead. It uses PhantomJS and should run in backend server.(But installing phantomjs is required)
mock the request. Because logining will usually invoke a http/https request. You can use python to mock that request. You can use http debbuging tools like fiddler, wireshark or Chrome web inspector to capture the information the browser sent to authentication server.
I tries to be specific and detailed. But due to the diversity of web crawling a step by step guide is beyond my reach.

using Selenium, Firefox, Python to save download of EPS files to disk after automated clicking of download link

Tools: Ubuntu, Python, Selenium, Firefox
I am tying to automate the dowloading of image files from a subscription web site. I do not have access to the server other than through my paid subscription. To avoid having to click a button for each file download, I decided to automate it using Python, Selenium, and Firefox. (I have been using these three together for the first time for two days now. I also know very little about cookies.)
I am interested in downloading following three formats in order or preference: ['EPS', 'PNG', 'JPG']. A button for each format is available on the web site.
I have managed to have success in automating the downloading of the 'PNG' and 'JPG' files to disk by setting the Firefox preferences by hand as suggested in this post: python webcrawler downloading files
However, when the file is in an 'EPS' format, the "You have chosen to save" dialog box still pops open in the Firefox window.
As you can see from my code, I have set the preferences to save 'EPS' files to disk. (Again, 'JPG' and 'PNG' files are saved as expected.)
from selenium import webdriver
profile = webdriver.firefox.firefox_profile.FirefoxProfile()
profile.set_preference("browser.download.folderList", 1)
profile.set_preference("browser.download.manager.showWhenStarting", False)
profile.set_preference('browser.helperApps.neverAsk.saveToDisk',
'image/jpeg,image/png,application/postscript,'
'application/eps,application/x-eps,image/x-eps,'
'image/eps')
profile.set_preference("browser.helperApps.alwaysAsk.force", False)
profile.set_preference("plugin.disable_full_page_plugin_for_types",
"application/eps,application/x-eps,image/x-eps,"
"image/eps")
profile.set_preference(
"general.useragent.override",
"Mozilla/5.0 (X11; Ubuntu; Linux i686; rv:26.0)"
" Gecko/20100101 Firefox/26.0")
driver = webdriver.Firefox(firefox_profile=profile)
#I then log in and begin automated clicking to download files. 'JPG' and 'PNG' files are
#saved to disk as expected. The 'EPS' files present a save dialog box in Firefox.
I tried installing an extension for Firefox called "download-statusbar" that claims to negate any save dialog box from appearing. The extension loads in the Selenium Firefox browser, but it doesn't function. (A lot of reviews say the extension is broken despite the developers' insistence that it does function.) It isn't working for me anyway so I gave up on it.
I added this to the Firefox profile in that attempt:
#The extension loads, but it doesn't function.
download_statusbar = '/home/$USER/Downloads/'
'/download_statusbar_fixed-1.2.00-fx.xpi'
profile.add_extension(download_statusbar)
From reading other stackoverflow.com posts, I decided to see if I could download the file via the url with urllib2. As I understand how this would work, I would need to add cookies to the headers in order to authenticate the downloading of the 'EPS' file via a url.
I am unfamiliar with this technique, but here is the code I tried to use to download the file directly. It failed with a '403 Forbidden' response despite my attemps to set cookies in the urllib2 opener.
import urllib2
import cookielib
import logging
import sys
cookie_jar = cookielib.LWPCookieJar()
handlers = [
urllib2.HTTPHandler(),
urllib2.HTTPSHandler(),
]
[h.set_http_debuglevel(1) for h in handlers]
handlers.append(urllib2.HTTPCookieProcessor(cookie_jar))
#using selenium driver cookies, returns a list of dictionaries
cookies = driver.get_cookies()
opener = urllib2.build_opener(*handlers)
opener.addheaders = [(
'User-agent',
'Mozilla/5.0 (X11; Ubuntu; Linux i686; rv:26.0) '
'Gecko/20100101 Firefox/26.0'
)]
logger = logging.getLogger("cookielib")
logger.addHandler(logging.StreamHandler(sys.stdout))
logger.setLevel(logging.DEBUG)
for item in cookies:
opener.addheaders.append(('Cookie', '{}={}'.format(
item['name'], item['value']
)))
logger.info('{}={}'.format(item['name'], item['value']))
response = opener.open('http://path/to/file.eps')
#Fails with a 403 Forbidden response
Any thoughts or suggestions? Am I missing something easy or do I need to give up hope on an automated download of the EPS files? Thanks in advance.

Thank you to #unutbu for helping me solve this. I just didn't understand the anatomy of a file download. I do understand a little bit better now.
I ended up installing an extension called "Live HTTP Headers" on Firefox to examine the headers sent by the server. As it turned out, the 'EPS' files were sent with a 'Content-Type' of 'application/octet-stream'.
Now the EPS files are saved to disk as expected. I modified the Firefox preferences to the following:
profile.set_preference('browser.helperApps.neverAsk.saveToDisk',
'image/jpeg,image/png,'
'application/octet-stream')

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Godaddy website cannot be scraped with Selenium Python - python

Related

Selenium profile import -chrome [duplicate]

Wait for XHR Request and/or a Console message for Selenium in Python

How to handle javascript content and redirects after successful weblogin SSO authentication?

Python clicking a button on webpage (back end)

using Selenium, Firefox, Python to save download of EPS files to disk after automated clicking of download link

Categories

Resources