I wondered if it was possible to query multiple target urls at the same time in Python Selenium. My current code looks like the below which just calls one Url:
Target = 'https://www.skadden.com/professionals?skip=0&position=64063c2e-9576-4f66-91fa-166f4bede9b8&hassearched=true'
My body of code currently works and brings back the data I require. Am I able to call multiple URL's at same time. For example,
Target = ['https://www.skadden.com/professionals?skip=0&position=64063c2e-9576-4f66-91fa-166f4bede9b8&hassearched=true','https://www.skadden.com/professionals?skip=0&position=f1c78f66-a0a6-45e5-8d22-541bd65bb325&hassearched=true']
I have tried this code and it doesn't work. Any ideas?
Thanks
Chris
To do this you need to either start two separate browsers (i.e. request a new session twice) or open two separate tabs as described here: How to open a new tab using Selenium WebDriver with Java?
Related
I'm new to webscraping and I wanted to retrieve all the wins and losses within this season of the NHL. Now this url works fine: https://www.nhl.com/scores ... but the problem arises when I want to go back to previous dates like so: https://www.nhl.com/scores/2022-09-24 ... this is the url that shows up when I interact with the buttons in that first url. you can see for yourself. I know i'm missing something here but to me it's not as obvious. please enlighten me.
I then tried to see if there was a way to use https://www.nhl.com/scores/ to obtain the information I require but I am having trouble accessing that data.
I'd recommend not using the URLs to access a specific date's data and instead take a look at the Fetch/XHR requests in your browser's dev tools -> Network activity to see what kinds of API calls are being made whenever you click on a date. Then you can make a call directly to that API endpoint within your python script and parse the JSON response
You can use the requests library for this: https://requests.readthedocs.io/en/latest/
I am new at selenium webdriver, python, pytest. I need suggestion how I can write a test function criteria with full of information and suggest me to the best way for writing test case. Please provide professional way to write a function.
I think first you have to decide if one single session is controlled by a single or multiple functions or whether you want multiple instances or not.
When it comes to selenium, you initiate a browser, control it, then close it.
If you fail to do it in this order, you end up with a tonne of chrome.exe instances in taskman.
Best testcase is initiating a browser, opening google, typing a word, hitting search and then saving the contents of the page to a variable.
If your intention is web scraping, definitely get yourself a copy of
BeautifulSoup (pip install bs4)
from bs4 import BeautfulSoup as bs
I am working on a scraper built in RSelenium. A number of tasks are more easily accomplished using Python, so I've set up a .Rmd file with access to R and Python code chunks.
The R-side of the scraper opens a website in Chrome, logs in, and accesses and scrapes various pages behind the login wall. (This is being done with permission of the website owners, who would rather users scrape the data ourselves than put together a downloadable.)
I also need to download files from these pages, a task which I keep trying in RSelenium but repeatedly come back to Python solutions.
I don't want to take the time to rewrite the code in Python, as it's fairly robust, but my attempts to use Python result in opening a new driver, which starts a new session no longer logged in. Is there a way to have Python code chunks access an existing driver / session being driven by RSelenium?
(I will open a separate question with my RSelenium download issues if this solution doesn't pan out.)
As far as I can tell, and with help from user Jortega, Selenium does not support interaction with already open browsers, and Python cannot access an existing session created via R.
My solution has been to rewrite the scraper using Python.
I am on Windows 8.1, Python 3.6.
Is it possible to get all currently open websites in the Latest version Of Chrome and save the websites to a text file in D:/.
I tried opening file:
C:\Users\username\AppData\Local\Google\Chrome\User Data\Default\Current Tabs
But I receive an error saying that the file is opened in another program.
There is another file named History that contains URLs that are opened but it also contain characters like NULL.
I tried reading the file in python but I received UndicodeDecodeError(Not sure About This Word).
then I tried opening file by the following code:
with open('C:/Users/username/AppData/Local/Google/Chrome/User Data/Default/History',"r+",encoding='latin') as file:
data = file.read()
print(data)
And it worked. But I got 1 or 2 URLs while in the text file, there were no URLs.
Maybe there's another way something like importing a module.
Something like:
import chrome
url = chrome.get_url()
print(url)
Maybe selenium can also do this. But I don't know how.
Maybe there's another way to read the file with all links in python.
Want I want with it is that it detect websites opened, if mywebsite.com is opened for more than 10 minutes, it will automatically be blocked. The system has its own file:
C:\Windows\System32\drivers\etc\hosts
It will add the following at the end:
127.0.0.1 www.mywebsite.com
And the website will no longer be available to use.
You can use this methodology to store the tab data and manipulate it:
windows = driver.window_handles
You can store the windows using the above method.
current_window = driver.current_window_handle
This method will give you the current window that is being handled. You can go through the list 'windows' and check if it is current_window to navigate between the tabs.
driver.switch_to.window(windows[5])
This method will switch to a desired tab but I assume you already have it.
Now how do you store the time spent after the tabs are opened?
There are two ways to do it:
Internally, by referring to a pandas dataframe or list
Reading and writing to a file.
First you need to import the 'time' library inside the script
current_time=time.time()
current_time is an int representation of the current time. It's a linux timestamp.
In either one of these scenarios, you will need a structure such as this:
data=[]
for i in range(0,len(windows)):
data.append([ windows[i] , time.time() ])
This will give a structure as below:
[[window[0],1234564879],
[window[1],1234567896],...]
Here's the thing you miss:
for i in range(0,len(data)):
if time.time()-data[i][1] > 600 # If new timestamp minus the old one is bigger than 600 seconds
driver.switch_to(data[i][0])
driver.close()
My personal advice is that you start with stable API services to get whatever data you want instead of selenium. I would recommend SerpApi since I work there. It has variety of scrapers including a google results scraper, and it has 5000 free calls for new accounts.
I'm not sure how to find this information, I have found a few tutorials so far about using Python with selenium but none have so much as touched on this.. I am able to run some basic test scripts through python that automate selenium but it just shows the browser window for a few seconds and then closes it.. I need to get the browser output into a string / variable (ideally) or at least save it to a file so that python can do other things on it (parse it, etc).. I would appreciate if anyone can point me towards resources on how to do this. Thanks
using Selenium Webdriver and Python, you would simply access the .page_source property to get the source of the current page.
for example, using Firefox() driver:
from selenium import webdriver
driver = webdriver.Firefox()
driver.get('http://www.example.com/')
print(driver.page_source)
driver.quit()
There's a Selenium.getHtmlSource() method in Java, most likely it is also available in Python. It returns the source of the current page as string, so you can do whatever you want with it
Ok, so here is how I ended up doing this, for anyone who needs this in the future..
You have to use firefox for this to work.
1) create a new firefox profile (not necessary but ideal so as to separate this from normal firefox usage), there is plenty of info on how to do this on google, it depends on your OS how you do this
2) get the firefox plugin: https://addons.mozilla.org/en-US/firefox/addon/2704/ (this automatically saves all pages for a given domain name), you need to configure this to save whichever domains you intend on auto-saving.
3) then just start the selenium server to use the profile you created (below is an example for linux)
cd /root/Downloads/selenium-remote-control-1.0.3/selenium-server-1.0.3
java -jar selenium-server.jar -firefoxProfileTemplate /path_to_your_firefox_profile/
Thats it, it will now save all the pages for a given domain name whenever selenium visits them, selenium does create a bunch of garbage pages too so you could just delete these via a simple regex parsing and its up to you, from there how to manipulate the saved pages