Can't get cookies using selenium - python

I am using selenium python, I create a webdriver using firefox binary and profile (tor browser), I get a webpage, I navigate and when I try to use
cookies = driver.get_cookies()
It returns an empty list, but if I check the webpage using dev tools in the Storage -> Cookies they are set, but if I use the console typing
document.cookie
It returns an empty string.
Is it a problem about tor browser that dont allow to use this kind of js codes or am I missing something?

Related

How to automatically accept cookie for a website with selenium python

I am using selenium to automate some test on many websites. Everytime I get the cookie wall popup.
I know I can search for the xpath of the Accept cookie button and then click on it with selenium. This solution is not convenient for me because I need to search for the button manually. I want a script that accepts cookie for all sites automatically.
What I tried to do is get a cookie jar by making a request to the website with python requests and then set the cookie in Selenium ==> Not working with many error
I found this on stackoverflow :
fp = webdriver.FirefoxProfile()
fp.set_preference("network.cookie.cookieBehavior", 2)
driver = webdriver.Firefox(firefox_profile=fp, executable_path="./geckodriver")
This worked for google.com (no accept cookie popup appeared) but it failed with facebook.com and instagram.com

Get the current URL from chrome in python without using Selenium

I wanted to get the current URL from chrome or firefox for that matter but without using any automation tools like Selenium. I thought about accessing the chrome history DB and sorting based on time but that is only possible when the chrome window is closed. I wanted to get the current URL when chrome is open also.
Any help is appreciated.

Selenium page loads as blank unless browser is manually opened with same profile

I am using selenium for a crawling project, but I struggle with a specific webpage (both chrome and firefox).
I found 2 workarounds that work to an extend but I want to know why this issue happens and how to avoid it.
1) Opening chrome manually and then opening selenium with my user profile.
If i manually start chrome and then run:
from selenium import webdriver
options.add_argument(r"user-data-dir=C:\Users\User\AppData\Local\Google\Chrome\User Data")
driver = webdriver.Chrome(options=options)
the page loads as intended
2) Passing a variable in the request
by appending /?anything to the url the page loads as intended in selenium
For some reason the webpage has a function in the header despite not loading... I suspect this could be a clue but I do not know enough to determine the cause.

how to search for string inside chrome internal pages, using python and\or selenium

I'm trying to write script, that would allow me to search for a string inside Chrome's internal pages (for example: "chrome://help").
is there a simple way to do it, or does it require special tools, (like the Selenium webdriver API)?
i know it's possible to use it for "normal" web pages, but what about "internal" ones?
You can use selenium Webdriver to easily achieve this task, and in this case we will extract the version number of Google Chrome.
Here is the sample code with comments explaining every step:
from selenium import webdriver
# executable path should be the place where chrome driver is on the computer
driver = webdriver.Chrome(executable_path= '/users/user/Downloads/Chromedriver')
# This line tells the driver to go to the help section of chrome
driver.get('chrome://help')
# Because certain elements are stored in another Iframe, you must switch to this particular Iframe which is called 'help' in this case.
driver.switch_to.frame(driver.find_element_by_name('help'))
# retrive the text of the element and store it's text in a variable.
version_string = driver.find_element_by_id('version-container').text
# Now you can easily print it.
print version_string

How to get the content from web browser using python?

I have a webpage :
http://kff.org/womens-health-policy/state-indicator/ultrasound-requirements/#
and I need to extract the table from this webpage.
Problem Encountered : I have been using BeautifulSoup and requests to get the url content. The problem with these methods is that I am able to get the web content even before the table is being generated.
So I get empty table
< table>
< thead>
< /thead>
< tbody>
< /tbody>
< /table>
My approach : Now I am trying to open the url in the browser using
webbrowser.open_new_tab(url) and then get the content from the browser directly . This will give the server to update the table and then i will be able to get the content from the page.
Problem : I am not sure how to fetch information from Web browser directly .
Right now i am using Mozilla on windows system.
Closest link found website Link . But it gives which sites are opened and not the content
Is there any other way to let the table load in urllib2 or beautifulsoup and requests ? or is there any way to get the loaded content directly from the webpage.
Thanks
To add to Santiclause answer, if you want to scrape java-script populated data you need something to execute it.
For that you can use selenium package and webdriver such as Firefox or PhantomJS (which is headless) to connect to the page, execute the scripts and get the data.
example for your case:
from selenium import webdriver
driver = webdriver.Firefox() # You can replace this with other web drivers
driver.get("http://kff.org/womens-health-policy/state-indicator/ultrasound-requirements/#")
source = driver.page_source # Here is your populated data.
driver.quit() # don't forget to quit the driver!
of course if you can access direct json like user Santiclause mentioned, you should do that. You can find it by checking the network tab when inspecting the element on the website, which needs some playing around.
The reason the table isn't being filled is because Python doesn't process the page it receives with urllib2 - so there's no DOM, no Javascript that runs, et cetera.
After reading through the source, it looks like the information you're looking for can be found at http://kff.org/datacenter.json?post_id=32781 in JSON format.

Categories