I'm working on scraping data using selenium, for an academic research which will test how certain user behaviors across facebook and the web will affect the ads they see.
For this, I need to have a kinds of fake user which will first interact with facebook, then visit some sites with facebook cookies, allowing facebook to continue tracking its behavior, and then go back to facebook.
I haven't done much web development, and it seems I'm confused about how exactly to keep and load cookies for this scenario.
I've been trying to save and load cooking using the following code snippets:
# saving
pickle.dump(driver.get_cookies(), cookiesfile)
# loading
cookies = pickle.load(cookiesfile)
for cookie in cookies:
driver.add_cookie(cookie)
On facebook , this will either create an error message popup telling me to reload, or redirect me to the login page. On other sites, even ones that explicitly state they have facebook trackers, this will cause an InvalidCookieDomainException.
What am I doing wrong?
Instead of handling cookies yourself, I would recommend using ChromeOptions to persist a browser session. This could be more helpful in maintaining local storage and other cookies.
The next time you open a browser session, the chrome instance will have loaded the previous "profile" and will continue maintaining it.
options = webdriver.ChromeOptions()
options.add_argument('user-data-dir={}'.format(<path_to_a_folder_reserved_for_browser_data>))
driver = webdriver.Chrome(executable_path=<chromedriver_exe_path>, options=options)
Related
I'm trying to access and get data from www.cclonline.com website using python script.
this is the code.
import requests
from requests_html import HTML
source = requests.get('https://www.cclonline.com/category/409/PC-Components/Graphics-Cards/')
html = HTML(html=source.text)
print(source.status_code)
print(html.text)
this is the errors i get,
403
Access denied | www.cclonline.com used Cloudflare to restrict access
Please enable cookies.
Error 1020
Ray ID: 64c0c2f1ccb5d781 • 2021-05-08 06:51:46 UTC
Access denied
What happened?
This website is using a security service to protect itself from online attacks.
how can i solve this problem? Thanks.
So the site's robots.txt does not explicitly says no bot is allowed. But you need to make your request look like it's coming from an actual browser.
Now to solve the issue at hand. The response says you need to have cookies enabled. So that can be solved by using a headless browser like selenium. Selenium has everything a browser has to offer (it basically uses google chrome or a browser of your chosen as a driver). It will make the server think the request is coming from an actual browser and will return a response.
Learn more about how to use selenium for scraping here.
Also remember to adjust crawl time accordingly. Make pauses after each request and swap user-agents often.
There’s no a silver bullet for solving cloudflare challenges, I’ve tried in my projects the solutions proposed here on this website, using playwright with different options https://substack.thewebscraping.club/p/cloudflare-how-to-scrape
I am making a web scraper that can bring back my YouTube channel stats in python , so I went to my YT studio site and copied the link and pasted it print the soup using bs4.I took the whole test that was printed and created an html file and when i looked at it , it was the YouTube login page.
So now i want to login into this(lets say i can provide the password and email id in a text file) in order to scrape the yt studio stats.I have no idea bout this (im new to web scraping)
You can use YouTube API, you don't need web scraping for this task.
you can use youTubeAPI to perform your operation. If you are still looking for a method to perform via web scraping below is the code for it
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.common.keys import Keys
driver= webdriver.Chrome('')
driver.get('https://accounts.google.com/signin')
driver.find_element(By.XPATH,'//*[#id="identifierId"]').send_keys('xxxxxxxx#gmail.com');
driver.find_element(By.XPATH,'//*[#id="identifierNext"]/div/button').click();
driver.findElement(By.id("passwordNext")).click();
While doing via web scraping after entering an email address and trying to enter the password field, you may come across an error like below. It will happen because of multiple reasons like two-factor auth, not the secure browser.
you can disable two-factor auth for your login and give it a try with web scraping it will help
You likely login via a POST request. So you'll want to use a browser and login to YouTube while monitoring the Network using the browser. If you're using Firefox, it would be this, if you're using another browser it should have an equivalent. You'll want to find the form request it sends and then replicate that.
Although, if you're that new to web scraping, you might be better off starting with something easier or using YouTube's API.
I am trying to make a tool that does things on your website account. Some things use reCAPTCHA after you log in, so I want to know how I could use the firefox's saved cookies the are on the normal browser on selenium so that it skips the reCAPTCHA and assumes you're not a bot.
from selenium import webdriver
from selenium.common.exceptions import TimeoutException
from selenium.webdriver.support.ui import WebDriverWait # available since 2.4.0
from selenium.webdriver.support import expected_conditions as EC # available since 2.26.0
browser = webdriver.Chrome('C:/Users/xyz/Downloads/chromedriver.exe')
# Define all variables required
urlErep = browser.get('http://www.erepublik.com')
xPathToSubmitButton = "//*[#id='login_form']/div[1]/p[3]/button"
urlAlerts = 'https://www.erepublik.com/en/main/messages-alerts/1'
one = 1
xPathToAlerts = "//*[#id='deleteAlertsForm']/table/tbody/tr[%d]/td[3]/p" %one
def logintoerep():
email = browser.find_element_by_id("citizen_email")
password = browser.find_element_by_id("citizen_password")
email.send_keys('myemail')
password.send_keys('mypassword')
browser.find_element_by_xpath(xPathToSubmitButton).click()
logintoerep()
The text above is code I wrote using Selenium to login to erepublik.com.
My main goal is to verify some information on eRepublik.com whenever someone fills a Google Form, and then complete an action based on the Google Form data. I'm trying to login to eRepublik using Selenium, and in each attempt to run the script(which I need to run 24/7, so that whenever the form gets a new response the script is ran) it creates a new window, and after 10-20 times I've logged in to the website it asks for captcha which Selenium can't complete. While in my existing browser window, I'm already logged in so I don't have to worry about Captcha and can just run my code.
How can I bypass this problem? Because I need the script to be able to login every time on its own, but captcha won't allow that. The best solution would be to use Selenium on my existing browser windows, but it doesn't allow that.
Is is possible to copy some settings from my normal browser windows to the Selenium-run browser windows so that every time logs in automatically instead?
I'm open to any suggestions as long as they can get me to verify and complete a few minor actions in the website I've linked.
You can attach your Chrome profile to Selenium tests
options = webdriver.ChromeOptions()
options.add_argument("user-data-dir=C:\\Path") #Path to your chrome profile
browser = webdriver.Chrome(executable_path="C:\\Users\\chromedriver.exe", chrome_options=options)
First off, CAPTCHAs are meant to do exactly that: repel robots/scripts from brute-forcing, or doing repeated actions on certain app features (e.g: login/register flows, send messages, purchase flows, etc.). So you can only go around... never through.
That being said, you can simulate the logged-in state by doing one of the following:
loading the authentication cookies required for the user to be logged in (usually it's only one cookie with a token of some sorts);
loading a custom profile in the browser that already has that user logged in;
use some form of basic auth when navigating to that specific URL (if the web-app has any logic to support this);
Recommended approach: Usually in most companies (at least from my exp), there usually is a specific cookie, or flag that you can set to disable CAPTCHAs for testing purposes. If this is not the case, talk to your PM/DEVs to create such a feature that permits the testing of your web-app.
Don't want to advertise advertise my content, but I think I best tackled this topic HERE. Maybe it can further help.
Hope you solve the problem. Cheers!
I am currently automating a website and have a test which checks the functionality of the Remember Me option.
My test script logs in, entering a valid username and password and checks the Remember Me checkbox before clicking Log In button.
To test this functionality I save the cookies to file using pickle, close the browser and then reopen the browser (loading the cookies file).
def closeWebsite(self, saveCookies=False):
if saveCookies:
pickle.dump(self.driver.get_cookies(), open('cookies.pkl', 'wb'))
self.driver.close()
def openWebsite(self, loadCookies=False):
desired_caps = {}
desired_caps['browserName'] = 'firefox'
profile = webdriver.FirefoxProfile(firefoxProfile)
self.driver = webdriver.Firefox(profile)
self.driver.get(appUrl)
if loadCookies:
for cookie in pickle.load(open('cookies.pkl', 'rb')):
self.driver.add_cookie(cookie)
However, when I do this, the new browser is not logged in. I understand that everytime you call the open the browser a new Session is created and this session ID can be obtained using driver.session_id
Is it possible, in the openWebsite method to load a driver and specify the sessionID?
When I test this manually, the Remember Me option works as expected. I'm just having trouble understanding how Selenium handles this case.
For starters you're loading the page before adding the cookies. Although there is the potential for them to arrive before the page needs / queries them, this isn't correct let alone reliable.
Yet, if you try to set the cookies before any page has loaded you will get an error.
The solution seems to be this:
First of all, you need to be on the domain that the cookie will be valid for. If you are trying to preset cookies before you start interacting with
a site and your homepage is large / takes a while to load an
alternative is to find a smaller page on the site, [...]
In other words:
Navigate to your home page, or a small entry page on the same domain as appUrl (no need to wait until fully loaded).
Add your cookies.
Load appUrl. From then on you should be fine.