This is the first time, I am posting a question. Please forgive me if I do something incorrect.
I am trying to create a python-selenium script to get the source code of MULTIPLE web pages.
I am running the script in the following manner (via command line on windows 7)
python program.py < input.txt > output.htm
This does creates the result, however since I am using a loop function it is appending the same file with all the results.
Is there a way, I can create a NEW FILE FOR EACH result/print
Thanks in advance.
Here is my code,
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
import time
from selenium.webdriver.common.action_chains import ActionChains
path_to_chromedriver = '/Users/office/Desktop/chromedriver' # change path as needed
browser = webdriver.Chrome(executable_path = path_to_chromedriver)
while(True):
url = raw_input("")
url2 = raw_input("")
browser.get(url)
time.sleep(10)
browser.get(url2)
time.sleep(10)
element_to_hover_over = browser.find_element_by_xpath('//*[#id="personSummaryTable"]/tbody/tr/td[2]/div[5]/div/span[1]/a')
hover = ActionChains(browser).move_to_element(element_to_hover_over)
hover.perform()
time.sleep(5)
stuff = browser.page_source.encode('ascii', 'ignore')
print stuff
Jan's idea worked great,
All it needed was to let python decide a filename,
Thanks Jan
import datetime
suffix = ".html"
basename = datetime.datetime.now().strftime("%y%m%d_%H%M%S")
fname = "_".join([basename, suffix]) # e.g. 'mylogfile_120508_171442'
print fname
with open(fname, "w") as f:
f.write(stuff)
Welcome at SO
You have two options
Let Python code decide on name of output file
This will likely be based on current time, e.g.
import time
# here get somehow your page content
page_content = ?????
prefix = "outut-"
suffix = ".html"
fname = "{prefix}{now:d}{suffix}".format(now=time.time())
print fname
with open(fname, "w") as f:
f.write(page_content)
Let your external loop (out of Python) create the file name
This file name can be e.g. on Linux created by some form of date command.
Related
I am using the python webbrowser module to try and open a html file. I added a short thing to get code from a website to view, allowing me to store a web-page incase I ever need to view it without wifi, for instance a news article or something else.
The code itself is fairly short so far, so here it is:
import requests as req
from bs4 import BeautifulSoup as bs
import webbrowser
import re
webcheck = re.compile('^(https?:\/\/)?(www.)?([a-z0-9]+\.[a-z]+)([\/a-zA-Z0-9#\-_]+\/?)*$')
#Valid URL Check
while True:
url = input('URL (MUST HAVE HTTP://): ')
check = webcheck.search(url)
groups = list(check.groups())
if check != None:
for group in groups:
if group == 'https://':
groups.remove(group)
elif group.count('/') > 0:
groups.append(group.replace('/', '--'))
groups.remove(group)
filename = ''.join(groups) + '.html'
break
#Getting Website Data
reply = req.get(url)
soup = bs(reply.text, 'html.parser')
#Writing Website
with open(filename, 'w') as file:
file.write(reply.text)
#Open Website
webbrowser.open(filename)
webbrowser.open('https://www.youtube.com')
I added webbrowser.open('https://www.youtube.com') so that I knew the module was working, which it was, as it did open up youtube.
However, webbrowser.open(filename) doesn't do anything, yet it returns True if I define it as a variable and print it.
The html file itself has a period in the name, but I don't think that should matter as I have made a file without it as the name and it wont run.
Does webbrowser need special permissions to work?
I'm not sure what to do as I've removed characters from the filename and even showed that the module is working by opening youtube.
What can I do to fix this?
From the webbrowser documentation:
Note that on some platforms, trying to open a filename using this function, may work and start the operating system’s associated program. However, this is neither supported nor portable.
So it seems that webbrowser can't do what you want. Why did you expect that it would?
adding file:// + full path name does the trick for any wondering
I wrote a short program to automate the process of clicking and saving profiles on LinkedIn.
Brief:
The program reads from a txt file with a large amount of LI URLs.
Using Selenium, it opens them one by one, then, hit the "Open in Sales Navigator" button
A new tab is opening, and on it, it needs to click the "Save" button, and choose the relevant list to save on.
I have two main problems:
LinkedIn has 3 versions of the same page. How can I use a condition to check which page version is it? (meaning - if you can't find this button, move to the next version). From what I've seen, you can't really use the "If" function with selenium, cause it causing trouble. Any other suggestions?
More important, and the reason I opened this thread - I want to monitor the "failed" links. Let's say I have a list of 1000 LI URLs, and I ran the program to save them on my account. I want to monitor the ones it didn't save or failed to open (broken links, page unavailable, etc.). In order to execute that, I used a CSV file and ordered the program to save all the pages that already saved on this account, but it doesn't solve my problem. How can I make him save all of them and not just the ones that were already saved? (I find it hard to execute because when a page appears as "Unavailable", it jumps to the next one and I couldn't find a way to make him save it.
It makes it harder to work with it, cause when I put 500 or 1000 URLs, I can't tell which ones save and which ones aren't saved.
Here's the code:
import selenium.webdriver as webdriver
import selenium.webdriver.support.ui as ui
from selenium.webdriver.common.keys import Keys
from time import sleep
from selenium.webdriver.chrome.options import Options
from selenium.common.exceptions import NoSuchElementException
import csv
import random
options = webdriver.ChromeOptions()
options.add_argument('--lang=EN')
options.add_argument("--start-maximized")
prefs = {"profile.default_content_setting_values.notifications" : 2}
options.add_experimental_option("prefs",prefs)
driver = webdriver.Chrome(executable_path='assets\chromedriver', chrome_options=options)
driver.get("https://www.linkedin.com/login?fromSignIn=true")
minDelay=input("\n Please provide min delay in seconds : ")
maxDelay=input("\n Please provide max delay in seconds : ")
listNumber=input("\n Please provide list number : ")
outputFile=input('\n save skipped as?: ')
count=0
closed=2
with open("links.txt", "r") as links:
for link in links:
try:
driver.get(link.strip())
sleep(3)
driver.find_element_by_xpath("//button[#class='save-to-list-dropdown__trigger ph5 artdeco-button artdeco-button--primary artdeco-button--3 artdeco-button--pro artdeco-dropdown__trigger artdeco-dropdown__trigger--placement-bottom ember-view']").click()
sleep(2)
count+=1
if count==1:
driver.find_element_by_xpath("//ul[#class='save-to-list-dropdown__content']//ul//li["+str(listNumber)+"]").click()
else:
driver.find_element_by_xpath("//ul[#class='save-to-list-dropdown__content']//ul//li[1]").click()
sleep(2)
sleep(random.randint(int(minDelay), int(maxDelay)))
except:
if closed==0:
driver.close()
sleep(1)
fileOutput=open(outputFile+".csv", mode='a', newline='', encoding='utf-8')
file_writer = csv.writer(fileOutput, delimiter=',', quotechar='"', quoting=csv.QUOTE_MINIMAL)
file_writer.writerow([link.strip()])
fileOutput.close()
print("Finished.")
The common approach to have different sort of listeners is to use EventFiringWebDriver. See the example here:
from selenium import webdriver
from selenium.webdriver.support.abstract_event_listener import AbstractEventListener
from selenium.webdriver.support.event_firing_webdriver import EventFiringWebDriver
class EventListener(AbstractEventListener):
def before_click(self, element, driver):
if element.tag_name == 'a':
print('Clicking link:', element.get_attribute('href'))
if __name__ == '__main__':
driver = EventFiringWebDriver(driver=webdriver.Firefox(), event_listener=EventListener())
driver.get("https://webelement.click/en/welcome")
link = driver.find_element_by_xpath('//a[text()="All Posts"]')
link.click()
driver.quit()
UPD:
Basically your case does not really need that listener. However you can user it. Say you have link file like:
https://google.com
https://invalid.url
https://duckduckgo.com/
https://sadfsdf.sdf
https://stackoverflow.com
Then the way with EventFiringWebDriver would be:
from selenium import webdriver
from selenium.webdriver.support.abstract_event_listener import AbstractEventListener
from selenium.webdriver.support.event_firing_webdriver import EventFiringWebDriver
broken_urls = []
class EventListener(AbstractEventListener):
def on_exception(self, exception, drv):
broken_urls.append(drv.current_url)
if __name__ == '__main__':
driver = EventFiringWebDriver(driver=webdriver.Firefox(), event_listener=EventListener())
with open("links.txt", "r") as links:
for link in links:
try:
driver.get(link.strip())
except:
print('Cannot reach the link', link.strip())
print("Finished.")
driver.quit()
import csv
with open('broken_urls.csv', 'w', newline='') as broken_urls_csv:
wr = csv.writer(broken_urls_csv, quoting=csv.QUOTE_ALL)
wr.writerow(broken_urls)
and without EventFiringWebDriver would be:
broken_urls = []
if __name__ == '__main__':
from selenium import webdriver
driver = webdriver.Firefox()
with open("links.txt", "r") as links:
for link in links:
stripped_link = link.strip()
try:
driver.get(stripped_link)
except:
print('Cannot reach the link', link.strip())
broken_urls.append(stripped_link)
print("Finished.")
driver.quit()
import csv
with open('broken_urls.csv', 'w', newline='') as broken_urls_csv:
wr = csv.writer(broken_urls_csv, quoting=csv.QUOTE_ALL)
wr.writerow(broken_urls)
I am using the following script
from selenium import webdriver
import time
import urllib.parse
browser = webdriver.Chrome()
with open("google-search-terms.adoc") as fin:
for line_no, line in enumerate(fin):
line = line.strip()
query = urllib.parse.urlencode({'q': line})
browser.execute_script(
f"window.open('https://www.google.com/search?{query}');")
for x in range(len(browser.window_handles)):
browser.switch_to.window(browser.window_handles[x])
time.sleep(3)
try:
browser.find_elements_by_xpath(
"//*[#id='rso']/div/div/div/a/div/cite[contains(text(),'amazon')]").click()
except:
pass
The input file google-search-terms.adoc contains:
The Effective Executive by Peter Drucker
The Functions of the Executive
It open multiple tabs containing search result of the texts from input file.
It loop over the tabs every 3 seconds.
However, it is not clicking the expected search results?
What is wrong here?
Google has a feature where you can get the result from a particular website. So the procedure here is just to search via that feature and click the first link found:
from selenium import webdriver
import time
import urllib.parse
browser = webdriver.Chrome()
with open("google-search-terms.adoc") as fin:
for line_no, line in enumerate(fin):
line = line.strip()
query = urllib.parse.urlencode({'q': line + " site:amazon.com"})
browser.execute_script(
f"window.open('https://www.google.com/search?{query}');")
for x in range(len(browser.window_handles)):
browser.switch_to.window(browser.window_handles[x])
time.sleep(2)
try:
result = browser.find_elements_by_xpath('//div[#id="rso"]/div/div')[0]
result.find_element_by_xpath("./div/a").click()
except:
continue
I am trying to open a .csv file, and open link in .csv file with selenium, and loop through links in .csv file. I am new to Selenium . I can easily do it in beautiful soup.Can you please guide me through right direction.
from selenium import webdriver
from bs4 import BeautifulSoup as bs
import csv
import requests
contents =[]
filename = 'link_business_filter.csv'
def copy_json():
with open('vendors_info_bangkok.json',"a") as wt:
for x in script3:
wt.write(x)
wt.close()
return
with open(filename,'rt') as f:
data = csv.reader(f)
for row in data:
links = row[0]
contents.append(links)
for link in contents:
url_html = requests.get(link)
browser = webdriver.Chrome('chromedriver')
for link_loop in url_html:
open = browser.get(link_loop)
source = browser.page_source
data = bs(source,"html.parser")
body = data.find('body')
script = body
x_path = '//*[#id="react-root"]/section/main/div'
script2 = browser.find_element_by_xpath(x_path)
script3 = script2.text
print(script3)
copy_json()
First install selenium:
pip install selenium
Then according to your os install chromediver then test it by going to folder you have kept the driver and open terminal and type chromedriver, if there's no error then it works.
Then in your code you need to provide executable_path for the chromdriver
In you Code:
....code...
for link in contents:
url_html = requests.get(link)
path to chromdriver = 'C:/Users/chromedriver.exe' #<-- you can keep this file anywhere you wish
browser = webdriver.Chrome(executable_path= 'path_to_chromdriver') #<-- you can also give the path directly here
for link_loop in url_html:
...code...
I am trying to use Selenium in Python to save webpages on MacOS Firefox.
So far, I have managed to click COMMAND + S to pop up the SAVE AS window. However,
I don't know how to:
change the directory of the file,
change the name of the
file, and
click the SAVE AS button.
Could someone help?
Below is the code I have use to click COMMAND + S:
ActionChains(browser).key_down(Keys.COMMAND).send_keys("s").key_up(Keys.COMMAND).perform()
Besides, the reason for me to use this method is that I encounter Unicode Encode Error when I :-
write the page_source to a html file and
store scrapped information to a csv file.
Write to a html file:
file_object = open(completeName, "w")
html = browser.page_source
file_object.write(html)
file_object.close()
Write to a csv file:
csv_file_write.writerow(to_write)
Error:
UnicodeEncodeError: 'ascii' codec can't encode character u'\xf8' in
position 1: ordinal not in range(128)
with open('page.html', 'w') as f:
f.write(driver.page_source)
What you are trying to achieve is impossible to do with Selenium. The dialog that opens is not something Selenium can interact with.
The closes thing you could do is collect the page_source which gives you the entire HTML of a single page and save this to a file.
import codecs
completeName = os.path.join(save_path, file_name)
file_object = codecs.open(completeName, "w", "utf-8")
html = browser.page_source
file_object.write(html)
If you really need to save the entire website you should look into using a tool like AutoIT. This will make it possible to interact with the save dialog.
You cannot interact with system dialogs like save file dialog.
If you want to save the page html you can do something like this:
page = driver.page_source
file_ = open('page.html', 'w')
file_.write(page)
file_.close()
This is a complete, working example of the answer RemcoW provided:
You first have to install a webdriver, e.g. pip install selenium chromedriver_installer.
#!/usr/bin/env python
# -*- coding: utf-8 -*-
# core modules
import codecs
import os
# 3rd party modules
from selenium import webdriver
def get_browser():
"""Get the browser (a "driver")."""
# find the path with 'which chromedriver'
path_to_chromedriver = ('/usr/local/bin/chromedriver')
browser = webdriver.Chrome(executable_path=path_to_chromedriver)
return browser
save_path = os.path.expanduser('~')
file_name = 'index.html'
browser = get_browser()
url = "https://martin-thoma.com/"
browser.get(url)
complete_name = os.path.join(save_path, file_name)
file_object = codecs.open(complete_name, "w", "utf-8")
html = browser.page_source
file_object.write(html)
browser.close()