How to customize userAgentData(Sec-Ch-Ua) in selenium - python

https://chromedevtools.github.io/devtools-protocol/tot/Emulation/#method-canEmulate
enter image description here
If you look at the Emulation.setUserAgentOverride section of the developer protocol site here, there is the ability to enter a userAgentMetadata parameter, but Python Selenium doesn't recognize the parameter.
I want to customize Sec-Ch-Ua.
When I use the return navigator.userAgentData code I want it to come out like this.
{'brands': [{'brand': '.Not/A)Brand', 'version': '99'}, {'brand': 'Google Chrome', 'version': '103'}, {'brand': 'Chromium', 'version': '103'}], 'mobile': False, 'platform': 'Windows'}

Related

Add cookies to selenium

I am trying to add cookies to the selenium web driver. I am specifying it like this
driver.add_cookie({'domain': '.facebook.com',
'expiry': 1456567765,
'httpOnly': False,
'name': 'fr',
'path': '/',
'sameSite': 'None',
'secure': True,
'value': "scsvdsvbrsdvasvdsgdssdv"})
Tries also this way
for i in cookies:
drvier.add_cookie(i)
However, I am getting this error.
Message: invalid session id
Stacktrace:
0 chromedriver 0x000000010f17e788 chromedriver + 4515720

How to extract hidden table data from job-postings using BeautifulSoup?

Hi I'm doing a python course and for one of our assignments today, we're supposed to extract the job listings on: https://remoteok.com/remote-python-jobs
Here is a screenshot of the html in question:
python jobs f12
And here is what I've written so far:
import requests
from bs4 import BeautifulSoup
def extract(term):
url = f"https://remoteok.com/remote-{term}-jobs"
request = requests.get(url, headers={'User-Agent': 'Mozilla/5.0'})
if request.status_code == 200:
soup = BeautifulSoup(request.text, 'html.parser')
table = soup.find_all('table', id="jobsboard")
print(len(table))
for tbody in table:
tbody.find_all('tbody')
print(len(tbody))
else:
print("can't request website")
extract("python")
print(len(table)) gives me 1 and
print(len(tbody)) gives me 131.
So it's pretty clear that I've made a mistake somewhere, but I'm having trouble identifying the cause.
One suspicion I have is that when I do request the html text and parse it with BeautifulSoup I am not getting the full webpage. But otherwise, I'm really not sure what I'm doing wrong here..
Requests do not manipulate or render a website like a browser will do, it only provide the static HTML - Websites content is generated dynamically by JavaScript that converts some JSON data into structure.
Use these to extract your data:
[json.loads(e.text.strip()) for e in soup.select('table tr.job [type="application/ld+json"]')]
Result:
[{'#context': 'http://schema.org', '#type': 'JobPosting', 'datePosted': '2022-09-04T05:21:13+00:00', 'description': 'About the Team\n\nThe Design Infrastructure team designs, builds, and ships the Design System foundations and UI components used in all of DoorDash’s products, on all platforms. Specifically, the iOS team works closely with designers and product engineering teams across the company to help shape the Design System, and owns the shared UI library for iOS – developed for both SwiftUI and UIKit.\nAbout the Role\n\nWe are looking for a lead iOS engineer who has a strong passion for UI components and working very closely with design. As part of the role you will be leading the iOS initiative for our Design System, which will include working closely with designers and iOS engineers on product teams to align, develop, maintain, and evolve the library of foundations and UI components; which is adopted in all our products.\n\nYou will report into the Lead Design Technologist for Mobile on our Design Infrastructure team in our Product Design organization. This role is 100% flexible, and can b\n Apply now and work remotely at DoorDash', 'baseSalary': {'#type': 'MonetaryAmount', 'currency': 'USD', 'value': {'#type': 'QuantitativeValue', 'minValue': 70000, 'maxValue': 120000, 'unitText': 'YEAR'}}, 'employmentType': 'FULL_TIME', 'directApply': 'http://schema.org/False', 'industry': 'Startups', 'jobLocationType': 'TELECOMMUTE', 'jobLocation': [{'address': {'#type': 'PostalAddress', 'addressCountry': 'United States', 'addressRegion': 'Anywhere', 'streetAddress': 'Anywhere', 'postalCode': 'Anywhere', 'addressLocality': 'Anywhere'}}], 'applicantLocationRequirements': [{'#type': 'Country', 'name': 'United States'}], 'title': 'Lead Design Technologist iOS', 'image': 'https://remoteok.com/assets/img/jobs/f2f1ab68227768717536a0ab7e2578ab1662268873.png', 'occupationalCategory': 'Lead Design Technologist iOS', 'workHours': 'Flexible', 'validThrough': '2022-12-03T05:21:13+00:00', 'hiringOrganization': {'#type': 'Organization', 'name': 'DoorDash', 'url': 'https://remoteok.com/doordash', 'sameAs': 'https://remoteok.com/doordash', 'logo': {'#type': 'ImageObject', 'url': 'https://remoteok.com/assets/img/jobs/f2f1ab68227768717536a0ab7e2578ab1662268873.png'}}}, {'#context': 'http://schema.org', '#type': 'JobPosting', 'datePosted': '2022-09-03T00:00:09+00:00', 'description': "We’re seeking a senior core, distributed systems engineers to build dev tools. At [Iterative](https://iterative.ai) we build [DVC](https://dvc.org) (9000+ ⭐on GitHub) and [CML](https://cml.dev) (2000+ ⭐ on GitHub) and a few other projects that are not released yet. It's a great opportunity if you love open source, dev tools, systems programming, and remote work. Join our well-funded remote-first team to build developer tools to see how your code is used by thousands of developers every day!\n\nABOUT YOU\n\n- Excellent communication skills and a positive mindset 🤗\n- No prior deep knowledge of ML is required\n- At least one year of experience with file systems, concurrency, multithreading, and server architectures\n- Passionate about building highly reliable system software\n- Python knowledge and excellent coding culture (standards, unit test, docs, etc) are required.\n- Initiative to help shape the engineering practices, products, and culture of a young startup\n- R\n Apply now and work remotely at Iterative", 'baseSalary': {'#type': 'MonetaryAmount', 'currency': 'USD', 'value': {'#type': 'QuantitativeValue', 'minValue': 50000, 'maxValue': 180000, 'unitText': 'YEAR'}}, 'employmentType': 'FULL_TIME', 'directApply': 'http://schema.org/False', 'industry': 'Startups', 'jobLocationType': 'TELECOMMUTE', 'applicantLocationRequirements': {'#type': 'Country', 'name': 'Anywhere'}, 'jobLocation': {'address': {'#type': 'PostalAddress', 'addressCountry': 'Anywhere', 'addressRegion': 'Anywhere', 'streetAddress': 'Anywhere', 'postalCode': 'Anywhere', 'addressLocality': 'Anywhere'}}, 'title': 'Senior Software Engineer', 'image': 'https://remoteOK.com/assets/img/jobs/cb9a279f231a5312283e6d935bba3be91636086324.png', 'occupationalCategory': 'Senior Software Engineer', 'workHours': 'Flexible', 'validThrough': '2022-12-02T00:00:09+00:00', 'hiringOrganization': {'#type': 'Organization', 'name': 'Iterative', 'url': 'https://remoteok.com/iterative', 'sameAs': 'https://remoteok.com/iterative', 'logo': {'#type': 'ImageObject', 'url': 'https://remoteOK.com/assets/img/jobs/cb9a279f231a5312283e6d935bba3be91636086324.png'}}}, {'#context': 'http://schema.org', '#type': 'JobPosting', 'datePosted': '2022-09-06T09:10:04+00:00', 'description': '<p dir="ltr">Get a remote job that you will love with better compensation and career growth.<strong></strong></p><p dir="ltr">We’re Lemon.io — a marketplace where we match you with hand-picked startups from the US and Europe.\xa0<strong></strong></p><p dir="ltr"><br /></p><p dir="ltr"><strong>Why work with us:</strong></p><ul><li dir="ltr"><p dir="ltr">We’ll find you a team that respects you. No time-trackers or any micromanagement stuff</p></li><li dir="ltr"><p dir="ltr">Our engineers earn $5k - $9k / month. We’ve already paid out over $10M.</p></li><li dir="ltr"><p dir="ltr">Choose your schedule. We have both full- and part-time projects.</p></li><li dir="ltr"><p dir="ltr">No project managers in the middle — only direct communications with clients, most of whom have a technical background</p></li><li dir="ltr"><p dir="ltr">Our customer success team provides life support to help you resolve anything.</p></li><li dir="ltr"><p dir="ltr">You don’\n Apply now and work remotely at lemon.io', 'baseSalary': {'#type': 'MonetaryAmount', 'currency': 'USD', 'value': {'#type': 'QuantitativeValue', 'minValue': 60000, 'maxValue': 110000, 'unitText': 'YEAR'}}, 'employmentType': 'FULL_TIME', 'directApply': 'http://schema.org/False', 'industry': 'Startups', 'jobLocationType': 'TELECOMMUTE', 'applicantLocationRequirements': {'#type': 'Country', 'name': 'Anywhere'}, 'jobLocation': {'address': {'#type': 'PostalAddress', 'addressCountry': 'Anywhere', 'addressRegion': 'Anywhere', 'streetAddress': 'Anywhere', 'postalCode': 'Anywhere', 'addressLocality': 'Anywhere'}}, 'title': 'DevOps Engineer', 'image': 'https://remoteOK.com/assets/img/jobs/b31a9584a903e655bd2f67a2d7f584781662455404.png', 'occupationalCategory': 'DevOps Engineer', 'workHours': 'Flexible', 'validThrough': '2022-12-05T09:10:04+00:00', 'hiringOrganization': {'#type': 'Organization', 'name': 'lemon.io', 'url': 'https://remoteok.com/lemon-io', 'sameAs': 'https://remoteok.com/lemon-io', 'logo': {'#type': 'ImageObject', 'url': 'https://remoteOK.com/assets/img/jobs/b31a9584a903e655bd2f67a2d7f584781662455404.png'}}},...]
Example
import requests,json
from bs4 import BeautifulSoup
def extract(term):
url = f"https://remoteok.com/remote-{term}-jobs"
request = requests.get(url, headers={'User-Agent': 'Mozilla/5.0'})
if request.status_code == 200:
soup = BeautifulSoup(request.text, 'html.parser')
table = soup.find_all('table', id="jobsboard")
print(len(table))
for tbody in table:
tbody.find_all('tbody')
print(len(tbody))
else:
print("can't request website")
data = [json.loads(e.text.strip()) for e in soup.select('table tr.job [type="application/ld+json"]')]
return data
for post in extract("python"):
print(post['hiringOrganization']['name'])
Output
DoorDash
Iterative
lemon.io
Angaza
Angaza
Kandji
Kandji
Kandji
Kandji
Kandji
Great Minds
Jobber
Udacity
...
tbody does appear on the web page but isn't pulled into the table variable by beautifulsoup.
I encountered this before. The solution is to get your tags directly from selenium.
But there is only one jobsboard and one tbody on the web page; so you could just skip tbody and look for a more useful tag.
I use Google Chrome. It has the free extension ChroPath, which makes it super easy to identify selectors. I just right click on text in a browser and select Inspect, sometimes twice, and the correct HTML tag is highlighted.
PyCharm allows you to view the contents of each variable with ease.
This code will allow you to view the web page HTML source code in a text file:
outputFile = r"C:\Users\user\Documents\HP Laptop\Documents\Documents\Jobs\DIT\IDMB\OutputZ.txt"
def update_output_file(pageSource: str):
with open(outputFile, 'w', encoding='utf-8') as f:
f.write(pageSource)
f.close()

get cookies for www subdomain, or a particular domain?

I'm calling get_cookies() on my selenium web driver. Of course we know this fetches the cookies for the current domain. However, many popular sites set cookies on both example.com and www.example.com.
Technically, it's not really a "separate domain" or even sub domain. I think nearly every website on the internet has the same site at the www sub domain as it does the root.
So is it still impossible to save cookies for the two domains, since one is a sub domain? I know the answer is complicated if you want to save cookies for all domains, but I figured this is kind of different since they really are the same domain.
Replicate it with this code:
from selenium import webdriver
import requests
driver = webdriver.Firefox()
driver.get("https://www.instagram.com/")
print(driver.get_cookies())
output:
[{'name': 'ig_did', 'value': 'F5FDFBB0-7D13-4E4E-A100-C627BD1998B7', 'path': '/', 'domain': '.instagram.com', 'secure': True, 'httpOnly': True, 'expiry': 1671083433}, {'name': 'mid', 'value': 'X9hOqQAEAAFWnsZg8-PeYdGqVcTU', 'path': '/', 'domain': '.instagram.com', 'secure': True, 'httpOnly': False, 'expiry': 1671083433}, {'name': 'ig_nrcb', 'value': '1', 'path': '/', 'domain': '.instagram.com', 'secure': True, 'httpOnly': False, 'expiry': 1639547433}, {'name': 'csrftoken', 'value': 'Yy8Bew6500BinlUcAK232m7xPnhOuN4Q', 'path': '/', 'domain': '.instagram.com', 'secure': True, 'httpOnly': False, 'expiry': 1639461034}]
Then load the page in a fresh browser instance and check yourself. You'll see www is there.
The main domain looks fine though:
My idea is to use requests library and get all cookies via REST query?
import requests
# Making a get request
response = requests.get('https://www.instagram.com/')
# printing request cookies
print(response.cookies)
Domain
To host your application on the internet need a domain name. Domain names act as a placeholder for the complex string of numbers known as an IP address. As an example,
https://www.instagram.com/
With the latest firefox v84.0 accessing the Instagram application the following cookies are observed within the https://www.instagram.com domain:
Subdomain
A subdomain is an add-on to your primary domain name. For example, when using the sites e.g. Craigslist, you are always using a subdomain like reno.craigslist.org, or sfbay.craigslist.org. You will be automatically be forwarded to the subdomain that corresponds to your physical location. Essentially, a subdomain is a separate part of your website that operates under the same primary domain name.
Reusing cookies
If you have stored the cookie from domain example.com, these stored cookies can't be pushed through the webdriver session to any other different domanin e.g. example.edu. The stored cookies can be used only within example.com. Further, to automatically login an user in future, you need to store the cookies only once, and that's when the user have logged in. Before adding back the cookies you need to browse to the same domain from where the cookies were collected.
Demonstration
As an example, you can store the cookies once the user have logged in within an application as follows:
from selenium import webdriver
import pickle
driver = webdriver.Chrome()
driver.get('http://demo.guru99.com/test/cookie/selenium_aut.php')
driver.find_element_by_name("username").send_keys("abc123")
driver.find_element_by_name("password").send_keys("123xyz")
driver.find_element_by_name("submit").click()
# storing the cookies
pickle.dump( driver.get_cookies() , open("cookies.pkl","wb"))
driver.quit()
Later at any point of time if you want the user automatically logged-in, you need to browse to the specific domain /url first and then you have to add the cookies as follows:
from selenium import webdriver
import pickle
driver = webdriver.Chrome()
driver.get('http://demo.guru99.com/test/cookie/selenium_aut.php')
# loading the stored cookies
cookies = pickle.load(open("cookies.pkl", "rb"))
for cookie in cookies:
# adding the cookies to the session through webdriver instance
driver.add_cookie(cookie)
driver.get('http://demo.guru99.com/test/cookie/selenium_cookie.php')
Reference
You can find a detailed discussion in:
org.openqa.selenium.InvalidCookieDomainException: Document is cookie-averse using Selenium and WebDriver

How to keep browsers open after console is closed

I am using selenium to do a bit of automation, but I would like to be able to keep the browser windows open even after the python console has been closed.
Here are my current settings for the webdriver:
capabilities = {
'browserName': 'chrome',
'version': '',
'platform': 'ANY',
'javascriptEnabled': True,
'chromeOptions': {
'useAutomationExtension': False,
'forceDevToolsScreenshot': True,
'detach': False,
'args': ['--start-maximized', '--disable-infobars', '--log-level=3']
}
}
driver = webdriver.Chrome(desired_capabilities=capabilities)
Does anyone know how I can achieve this? Thanks.

Selenium Add Cookies From CookieJar

I am trying to add python requests session cookies to my selenium webdriver.
I have tried this so far
for c in self.s.cookies :
driver.add_cookie({'name': c.name, 'value': c.value, 'path': c.path, 'expiry': c.expires})
This code is working fine for PhantomJS whereas it's not for Firefox and Chrome.
My Questions:
Is there any special iterating of cookiejar for Firefox and Chrome?
Why it is working for PhantomJS?
for cookie in s.cookies: # session cookies
# Setting domain to None automatically instructs most webdrivers to use the domain of the current window
# handle
cookie_dict = {'domain': None, 'name': cookie.name, 'value': cookie.value, 'secure': cookie.secure}
if cookie.expires:
cookie_dict['expiry'] = cookie.expires
if cookie.path_specified:
cookie_dict['path'] = cookie.path
driver.add_cookie(cookie_dict)
Check this for a complete solution https://github.com/cryzed/Selenium-Requests/blob/master/seleniumrequests/request.py

Categories