Firefox, Selenium, Webdriver: how to programmatically erase all cookies and site data? - python

I have the following setup:
Python 3.7
Selenium 3.141.0
Firefox 67.0.4
Geckodriver 0.24.0
I've written a program that scrapes hotel data off of a hotel operator website. Using the link as a query, the program supplies the site with city, check-in and check-out dates, number of persons, etc.
The program starts up firefox and makes the first query and it all goes well. The problem appears from the second query and on because no matter what is provided in the following links it will keep showing the results of the city in the first query, only changing the dates.
When the webdriver is restarted it's back to normal for the first query but beginning with the second one it all goes like before.
I tried using delete_all_cookies() and configuring the webdriver profile to no create any cache but it does not work. I tried using Python to delete all files inside the profile folder and it still does not work.
The strange thing is that if I go in the browser and manually delete "Cookies and other site data" it's ok, but I can't find a way of doing this programmatically. Tried it both in Firefox and Chrome.
Restarting the webdriver also works. I understand that it clears the profile and starts with a fresh one every time. But this is too costly from a time pov.
#First link, it all goes ok
URL = 'https://www.wyndhamhotels.com/en-us/hotels/beijing-china?brand_id=ALL&checkInDate=8/10/2019&checkOutDate=8/11/2019&useWRPoints=false&children=0&adults=2&rooms=1'
DRIVER.get(URL)
# From the second link on, no matter how many searches I d, I always get the results for Beijing
URL = 'https://www.wyndhamhotels.com/en-us/hotels/bremen-germany?brand_id=ALL&checkInDate=9/11/2019&checkOutDate=9/11/2019&useWRPoints=false&children=0&adults=2&rooms=1'
DRIVER.get(URL)
URL = 'https://www.wyndhamhotels.com/en-us/hotels/paris-france?brand_id=ALL&checkInDate=9/11/2020&checkOutDate=9/11/2020&useWRPoints=false&children=0&adults=2&rooms=1'
DRIVER.get(URL)
Is there any way to programmatically delete all cookies and other site data the way that this happens when you do it manually from the menu, while the webdriver is running?
Or, in another train of thoughts, what exactly happens when you manually delete cookies and other site data from the browser menu? What gets deleted and from where?

Related

Selenium doesn't keep cache valid

I'm working on a python software with selenium. The problem is I want my script and selenium to save cookies after logging in. I save cookies using both "pickle" module and the below argument:
opts.add_argument("user-data-dir=cachedD")
But when I quit the browser and then launch it again and going to the same URL as it left off, the website again redirects to the login page. The website is using "moodle", and I guess it's cookies expire after quitting the browser. How can I save cookies and continue where it left off? I should say that there's just a maximum 15 seconds gap between two launches.
You're potentially not using the tag correctly.
With this tag you specify a folder path. If you review this page:
--user-data-dir
Directory where the browser stores the user profile. ↪
That link may not look right, but the chromium page say that's the right list.
Historically, I've had success with:
.add_argument("user-data-dir=C:\Temp")
If that is still not working as you expect, there are a few other things you can look at.
Review this page - cookies can be deleted when you close your browser. You'll want to verify the value of this option.
Another check is to open your chromedriver via selenium and goto chrome://version/ . From here you can review what you're running and you'll see there are a lot more tags that are enabled by default. You should check that these match up to how you want your browser to behave.

Selenium + Webdriver return same information as first get after repeated gets

Repeated calls to the same URL using Python Selenium and a webdriver (geckodriver or Chrome driver) return the proper information the first time the program is run. But each successive runs even after a wait of 60 secs and even a reboot still returns the information from the first get URL.
Below is the code for the start of a program to scrape the odds at a racetrack every minute. The sleep in the code is less for testing purposes.
import time
import os
from selenium import webdriver
#url = "https://www.drf.com/live_odds/winodds/track/DED/USA/3/D"
#url = "https://www.drf.com/live_odds/winodds/track/TAM/USA/10/D"
#url = "https://www.drf.com/live_odds/winodds/track/AUS-MNG/AUS/5/D"
#url = "https://www.drf.com/live_odds/winodds/track/AUS-AUC/AUS/2/D"
url = "https://www.drf.com/live_odds/winodds/track/SA/USA/5/D"
driver = webdriver.Chrome()
driver.get(url)
driver.refresh()
time.sleep(50)
#url = "https://www.drf.com/live_odds/winodds/track/AQU/USA/7/D"
#url = "https://www.drf.com/live_odds/winodds/track/LA/USA/4/D"
driver.close()
driver.quit()
#os.system(killall "Chrome")
At first I thought the problem was in my requests so I transferred to Selenium and geckodriver and later Chrome Driver. Then it worked. The first time I got the URL the proper information was returned. The second time I used the same URL and did a get- the odds will eventually change - I still got the results from the first get URL. Even when I run the program again I still get the same results as the first get URL. But if I run Chrome without Selenium and go to the same URL I then get the proper updated odds. Also running Chrome without Selenium the odds are displayed horizontally across the page while the odds are displayed in a column when I run Selenium and the Chrome driver. I know there are often compatibility issues, but I downloaded Selenium and the drivers within the last two or three weeks.
This program will be far from accurate if you run it when SA - Santa Anita Racetrack is not running and in this code it would be race number 5. You can easily change the race number that matches the current race. And you can change the track by going to www.drf.com and then going to entries and clicking on Live Odds. There you will see a list of tracks and you can click on one and there you will see the appropriate URL. Paste that into the program and assign it as the new URL. Again you will see that the screen returned is correct but you can run the program again and again and you will only get the results obtained in the first screen. You will not get the new odds that you will get if you run Chrome without Selenium. Is there some references stuck in a buffer or is the website trying to prohibit continuous web scraping of the odds. I also tried rebooting but still got the old results from the URL when running again Selenium and Chrome Driver.
Again if I run just Chrome I get the new updates. Could this mean there is some reference to the original request that must be saved on the disk as all references in memory would be erased with a reboot? Could this involve a socket reference?

Python - Automating form entry on a .aspx website and storing output in a file (using Selenium?)

I've just started to learn coding this month and started with Python. I would like to automate a simple task (my first project) - visit a company's career website, retrieve all the jobs posted for the day and store them in a file. So this is what I would like to do, in sequence:
Go to http://www.nov.com/careers/jobsearch.aspx
Select the option - 25 Jobs per page
Select the date option - Today
Click on Search for Jobs
Store results in a file (just the job titles)
I looked around and found that Selenium is the best way to go about handling .aspx pages.
I have done steps 1-4 using Selenium. However, there are two issues:
I do not want the browser opening up. I just need the output saved to a file.
Even if I am ok with the browser popping up, using the Python code (exported from Selenium as Web Driver) on IDLE (i have windows OS) results in errors. When I run the Python code, the browser opens up and the link is loaded. But none of the form selections happen and I get the foll error message (link below), before the browser closes. So what does the error message mean?
http://i.stack.imgur.com/lmcDz.png
Any help/guidance will be appreciated...Thanks!
First about the error you've got, I should say that according to the expression NoSuchElementException and the message Unable to locate element, the selector you provided for the web-driver is wrong and web-driver can't find the element.
Well, since you did not post your code and I can't open the link of the website you entered, I can just give you a sample code and I will count as much details as I can.
from selenium import webdriver
driver = webdriver.Firefox()
driver.get("url")
number_option = driver.find_element_by_id("id_for_25_option_indicator")
number_option.click()
date_option = driver.find_element_by_id("id_for_today_option_indicator")
date_option.click()
search_button = driver.find_element_by_id("id_for_search_button")
search_button.click()
all_results = driver.find_elements_by_xpath("some_xpath_that_is_common_between_all_job_results")
result_file = open("result_file.txt", "w")
for result in all_results:
result_file.write(result.text + "\n")
driver.close()
result_file.close()
Since you said you just started to learn coding recently, I think I have to give some explanations:
I recommend you to use driver.find_element_by_id in all cases that elements have ID property. It's more robust.
Instead of result.text, you can use result.get_attribute("value") or result.get_attribute("innerHTML").
That's all came into my mind by now; but it's better if you post your code and we see what is wrong with that. Additionally, it would be great if you gave me a new link to the website, so I can add more details to the code; your current link is broken.
Concerning the first issue, you can simply use a headless browser. This is possible with Chrome as well as Firefox.
Check Grey Li's answer here for example: Python - Firefox Headless
from selenium import webdriver
options = webdriver.FirefoxOptions()
options.add_argument('headless')
driver = webdriver.Firefox(options=options)

Selenium Webdriver for Python: get page, enter values, click submit, get source

Alright, I'm confused. So I want to scrape a page using Selenium Webdriver and Python. I've recorded a test case in the Selenium IDE. It has stuff like
Command Taget
click link=14
But I don't see how to run that in Python. The desirable end result is that I have the source of the final page.
Is there a run_test_case command? Or do I have to write individual command lines? I'm rather missing the link between the test case and the actual automation. Every site tells me how to load the initial page and how to get stuff from that page, but how do I enter values and click on stuff and get the source?
I've seen:
submitButton=driver.find_element_by_xpath("....")
submitButton.click()
Ok. And enter values? And get the source once I've submitted a page? I'm sorry that this is so general, but I really have looked around and haven't found a good tutorial that actually shows me how to do what I thought was the whole point of Selenium Webdriver.
I've never used the IDE. I just write my tests or site automation by hand.
from selenium import webdriver
browser = webdriver.Firefox()
browser.get("http://www.google.com")
print browser.page_source
You could put that in a script and just do python wd_script.py or you could open up a Python shell and type it in by hand, watch the browser open up, watch it get driven by each line. For this to work you will obviously need Firefox installed as well. Not all versions of Firefox work with all versions of Selenium. The current latest versions of each (Firefox 19, Selenium 2.31) do though.
An example showing logging into a form might look like this:
username_field = browser.find_element_by_css_selector("input[type=text]")
username_field.send_keys("my_username")
password_field = browser.find_element_by_css_selector("input[type=password]")
password_field.send_keys("sekretz")
browser.find_element_by_css_selector("input[type=submit]").click()
print browser.page_source
This kind of stuff is much easier to write if you know css well. Weird errors can be caused by trying to find elements that are being generated in JavaScript. You might be looking for them before they exist for instance. It's easy enough to tell if this is the case by putting in a time.sleep for a little while and seeing if that fixes the problem. More elegantly you can abstract some kind of general wait for element function.
If you want to run Webdriver sessions as part of a suite of integration tests then I would suggest using Python's unittest to create them. You drive the browser to the site under test, and make assertions that the actions you are taking leave the page in a state you expect. I can share some examples of how that might work as well if you are interested.

Selenium with Python, how do I get the page output after running a script?

I'm not sure how to find this information, I have found a few tutorials so far about using Python with selenium but none have so much as touched on this.. I am able to run some basic test scripts through python that automate selenium but it just shows the browser window for a few seconds and then closes it.. I need to get the browser output into a string / variable (ideally) or at least save it to a file so that python can do other things on it (parse it, etc).. I would appreciate if anyone can point me towards resources on how to do this. Thanks
using Selenium Webdriver and Python, you would simply access the .page_source property to get the source of the current page.
for example, using Firefox() driver:
from selenium import webdriver
driver = webdriver.Firefox()
driver.get('http://www.example.com/')
print(driver.page_source)
driver.quit()
There's a Selenium.getHtmlSource() method in Java, most likely it is also available in Python. It returns the source of the current page as string, so you can do whatever you want with it
Ok, so here is how I ended up doing this, for anyone who needs this in the future..
You have to use firefox for this to work.
1) create a new firefox profile (not necessary but ideal so as to separate this from normal firefox usage), there is plenty of info on how to do this on google, it depends on your OS how you do this
2) get the firefox plugin: https://addons.mozilla.org/en-US/firefox/addon/2704/ (this automatically saves all pages for a given domain name), you need to configure this to save whichever domains you intend on auto-saving.
3) then just start the selenium server to use the profile you created (below is an example for linux)
cd /root/Downloads/selenium-remote-control-1.0.3/selenium-server-1.0.3
java -jar selenium-server.jar -firefoxProfileTemplate /path_to_your_firefox_profile/
Thats it, it will now save all the pages for a given domain name whenever selenium visits them, selenium does create a bunch of garbage pages too so you could just delete these via a simple regex parsing and its up to you, from there how to manipulate the saved pages

Categories