I am trying to open a .csv file, and open link in .csv file with selenium, and loop through links in .csv file. I am new to Selenium . I can easily do it in beautiful soup.Can you please guide me through right direction.
from selenium import webdriver
from bs4 import BeautifulSoup as bs
import csv
import requests
contents =[]
filename = 'link_business_filter.csv'
def copy_json():
with open('vendors_info_bangkok.json',"a") as wt:
for x in script3:
wt.write(x)
wt.close()
return
with open(filename,'rt') as f:
data = csv.reader(f)
for row in data:
links = row[0]
contents.append(links)
for link in contents:
url_html = requests.get(link)
browser = webdriver.Chrome('chromedriver')
for link_loop in url_html:
open = browser.get(link_loop)
source = browser.page_source
data = bs(source,"html.parser")
body = data.find('body')
script = body
x_path = '//*[#id="react-root"]/section/main/div'
script2 = browser.find_element_by_xpath(x_path)
script3 = script2.text
print(script3)
copy_json()
First install selenium:
pip install selenium
Then according to your os install chromediver then test it by going to folder you have kept the driver and open terminal and type chromedriver, if there's no error then it works.
Then in your code you need to provide executable_path for the chromdriver
In you Code:
....code...
for link in contents:
url_html = requests.get(link)
path to chromdriver = 'C:/Users/chromedriver.exe' #<-- you can keep this file anywhere you wish
browser = webdriver.Chrome(executable_path= 'path_to_chromdriver') #<-- you can also give the path directly here
for link_loop in url_html:
...code...
Related
I would like to download data from http://ec.europa.eu/taxation_customs/vies/ site. Case is that when I enter data on it through program the URL doesn't change, so file saved on disc has a page same as the one which were opened from the begining without data.Maybe I don't know how to access this site after adding data? I'm new in Python and tried to look for solution but with no result so if there was such issue, please link me. Here's my code. I appreciate all responses:)
import requests
import selenium
import select as something
from selenium import webdriver
from selenium.webdriver.support.ui import Select
import pdfkit
url = "http://ec.europa.eu/taxation_customs/vies/?locale=pl"
driver = webdriver.Chrome(executable_path ="C:\\Users\\Python\\Chromedriver.exe")
driver.get("http://ec.europa.eu/taxation_customs/vies/")
#wait = WebDriverWait(driver, 10)
obj = Select(driver.find_element_by_id("countryCombobox"))
obj = obj.select_by_index(1)
vies_r = requests.get(url)
vies_vat = driver.find_element_by_id("number")
vies_vat.send_keys('U54799909')
vies_verify = driver.find_element_by_id("submit")
vies_verify.click()
path_wkhtmltopdf = r'C:\Users\Python\wkhtmltox\wkhtmltox\bin\wkhtmltopdf.exe'
config = pdfkit.configuration(wkhtmltopdf=path_wkhtmltopdf)
print(driver.current_url)
pdfkit.from_url(driver.current_url, "out.pdf", configuration=config)
Ukalo
I am having troubles downloading txt file from this page: https://www.ceps.cz/en/all-data#RegulationEnergy (when you scroll down and see Download: txt, xls and xml).
My goal is to create scraper that will go to the linked page, clicks on the txt link for example and saves a downloaded file.
Main problems that I am not sure how to solve:
The file doesn't have a real link that I can call and download it, but the link is created with JS based on filters and file type.
When I use requests library for python and call the link with all headers it just redirects me to https://www.ceps.cz/en/all-data .
Approaches tried:
Using scraper such as ParseHub to download link didn't work as intended. But this scraper was the closest to what I've wanted to get.
Used requests library to connect to the link using headers that HXR request uses for downloading the file but it just redirects me to https://www.ceps.cz/en/all-data .
If you could propose some solution for this task, thank you in advance. :-)
You can download this data to a directory of your choice with Selenium; you just need to specify the directory to which the data will be saved. In what follows below, I'll save the txt data to my desktop:
from selenium import webdriver
download_dir = '/Users/doug/Desktop/'
chrome_options = webdriver.ChromeOptions()
prefs = {'download.default_directory' : download_dir}
chrome_options.add_experimental_option('prefs', prefs)
driver = webdriver.Chrome(chrome_options=chrome_options)
driver.get('https://www.ceps.cz/en/all-data')
container = driver.find_element_by_class_name('download-graph-data')
button = container.find_element_by_tag_name('li')
button.click()
You should do like so:
import requests
txt_format = 'txt'
xls_format = 'xls' # open in binary mode
xml_format = 'xlm' # open in binary mode
def download(file_type):
url = f'https://www.ceps.cz/download-data/?format={txt_format}'
response = requests.get(url)
if file_type is txt_format:
with open(f'file.{file_type}', 'w') as file:
file.write(response.text)
else:
with open(f'file.{file_type}', 'wb') as file:
file.write(response.content)
download(txt_format)
I have a Selenium script in Python (using ChromeDriver on Windows) that fetches the download links of various attachments(of different file types) from a page and then opens these links to download the attachments. This works fine for the file types which ChromeDriver can't preview as they get downloaded by default. But images(JPEG, PNG) and PDFs are previewed by default and hence aren't automatically downloaded.
The ChromeDriver options I am currently using (work for non preview-able files) :
chrome_options = webdriver.ChromeOptions()
prefs = {'download.default_directory' : 'custom_download_dir'}
chrome_options.add_experimental_option('prefs', prefs)
driver = webdriver.Chrome("./chromedriver.exe", chrome_options=chrome_options)
This downloads the files to 'custom_download_dir', no issues. But the preview-able files are just previewed in the ChromeDriver instance and not downloaded.
Are there any ChromeDriver Settings that can disable this preview behavior and directly download all files irrespective of the extensions?
If not, can this be done using Firefox for instance?
Instead of relying in specific browser / driver options I would implement a more generic solution using the image url to perform the download.
You can get the image URL using similar code:
driver.find_element_by_id("your-image-id").get_attribute("src")
And then I would download the image using, for example, urllib.
Here's some pseudo-code for Python2:
import urllib
url = driver.find_element_by_id("your-image-id").get_attribute("src")
urllib.urlretrieve(url, "local-filename.jpg")
Here's the same for Python3:
import urllib.request
url = driver.find_element_by_id("your-image-id").get_attribute("src")
urllib.request.urlretrieve(url, "local-filename.jpg")
Edit after the comment, just another example about how to download a file once you know its URL:
import requests
from PIL import Image
from io import StringIO
image_name = 'image.jpg'
url = 'http://example.com/image.jpg'
r = requests.get(url)
i = Image.open(StringIO(r.content))
i.save(image_name)
With selenium-wire library, it is possible to download images via ChromeDriver.
I have defined the following function to parse each request and save the request body to a file when necessary.
import os
from mimetypes import guess_extension
from seleniumwire import webdriver
def download_assets(requests, asset_dir="temp", default_fname="untitled", exts=[".png", ".jpeg", ".jpg", ".svg", ".gif", ".pdf", ".ico"]):
asset_list = {}
for req_idx, request in enumerate(requests):
# request.headers
# request.response.body is the raw response body in bytes
ext = guess_extension(request.response.headers['Content-Type'].split(';')[0].strip())
if ext is None or ext not in exts:
#Don't know the file extention, or not in the whitelist
continue
# Construct a filename
fname = os.path.basename(request.url.split('?')[0])
fname = "".join(x for x in fname if (x.isalnum() or x in "._- "))
if fname == "":
fname = f"{default_fname}_{req_idx}"
if not fname.endswith(ext):
fname = f"{fname}{ext}"
fpath = os.path.join(asset_dir, fname)
# Save the file
print(f"{request.url} -> {fpath}")
asset_list[fpath] = request.url
with open(fpath, "wb") as file:
file.write(request.response.body)
return asset_list
Let's download some images from Google homepage to temp folder.
# Create a new instance of the Chrome/Firefox driver
driver = webdriver.Chrome()
# Go to the Google home page
driver.get('https://www.google.com')
# Download content to temp folder
asset_dir = "temp"
os.makedirs(asset_dir, exist_ok=True)
download_assets(driver.requests, asset_dir=asset_dir)
driver.close()
Note that the function can be improved such that the directory structure can be kept as well.
Here is another simple way, but #Pitto's answer above is slightly more succinct.
import requests
webelement_img = ff.find_element(By.XPATH, '//img')
url = webelement_img.get_attribute('src') or 'https://someimages.com/path-to-image.jpg'
data = requests.get(url).content
local_filename = 'filename_on_your_computer.jpg'
with open (local_filename, 'wb') as f:
f.write(data)
I am trying to use Selenium in Python to save webpages on MacOS Firefox.
So far, I have managed to click COMMAND + S to pop up the SAVE AS window. However,
I don't know how to:
change the directory of the file,
change the name of the
file, and
click the SAVE AS button.
Could someone help?
Below is the code I have use to click COMMAND + S:
ActionChains(browser).key_down(Keys.COMMAND).send_keys("s").key_up(Keys.COMMAND).perform()
Besides, the reason for me to use this method is that I encounter Unicode Encode Error when I :-
write the page_source to a html file and
store scrapped information to a csv file.
Write to a html file:
file_object = open(completeName, "w")
html = browser.page_source
file_object.write(html)
file_object.close()
Write to a csv file:
csv_file_write.writerow(to_write)
Error:
UnicodeEncodeError: 'ascii' codec can't encode character u'\xf8' in
position 1: ordinal not in range(128)
with open('page.html', 'w') as f:
f.write(driver.page_source)
What you are trying to achieve is impossible to do with Selenium. The dialog that opens is not something Selenium can interact with.
The closes thing you could do is collect the page_source which gives you the entire HTML of a single page and save this to a file.
import codecs
completeName = os.path.join(save_path, file_name)
file_object = codecs.open(completeName, "w", "utf-8")
html = browser.page_source
file_object.write(html)
If you really need to save the entire website you should look into using a tool like AutoIT. This will make it possible to interact with the save dialog.
You cannot interact with system dialogs like save file dialog.
If you want to save the page html you can do something like this:
page = driver.page_source
file_ = open('page.html', 'w')
file_.write(page)
file_.close()
This is a complete, working example of the answer RemcoW provided:
You first have to install a webdriver, e.g. pip install selenium chromedriver_installer.
#!/usr/bin/env python
# -*- coding: utf-8 -*-
# core modules
import codecs
import os
# 3rd party modules
from selenium import webdriver
def get_browser():
"""Get the browser (a "driver")."""
# find the path with 'which chromedriver'
path_to_chromedriver = ('/usr/local/bin/chromedriver')
browser = webdriver.Chrome(executable_path=path_to_chromedriver)
return browser
save_path = os.path.expanduser('~')
file_name = 'index.html'
browser = get_browser()
url = "https://martin-thoma.com/"
browser.get(url)
complete_name = os.path.join(save_path, file_name)
file_object = codecs.open(complete_name, "w", "utf-8")
html = browser.page_source
file_object.write(html)
browser.close()
This is the first time, I am posting a question. Please forgive me if I do something incorrect.
I am trying to create a python-selenium script to get the source code of MULTIPLE web pages.
I am running the script in the following manner (via command line on windows 7)
python program.py < input.txt > output.htm
This does creates the result, however since I am using a loop function it is appending the same file with all the results.
Is there a way, I can create a NEW FILE FOR EACH result/print
Thanks in advance.
Here is my code,
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
import time
from selenium.webdriver.common.action_chains import ActionChains
path_to_chromedriver = '/Users/office/Desktop/chromedriver' # change path as needed
browser = webdriver.Chrome(executable_path = path_to_chromedriver)
while(True):
url = raw_input("")
url2 = raw_input("")
browser.get(url)
time.sleep(10)
browser.get(url2)
time.sleep(10)
element_to_hover_over = browser.find_element_by_xpath('//*[#id="personSummaryTable"]/tbody/tr/td[2]/div[5]/div/span[1]/a')
hover = ActionChains(browser).move_to_element(element_to_hover_over)
hover.perform()
time.sleep(5)
stuff = browser.page_source.encode('ascii', 'ignore')
print stuff
Jan's idea worked great,
All it needed was to let python decide a filename,
Thanks Jan
import datetime
suffix = ".html"
basename = datetime.datetime.now().strftime("%y%m%d_%H%M%S")
fname = "_".join([basename, suffix]) # e.g. 'mylogfile_120508_171442'
print fname
with open(fname, "w") as f:
f.write(stuff)
Welcome at SO
You have two options
Let Python code decide on name of output file
This will likely be based on current time, e.g.
import time
# here get somehow your page content
page_content = ?????
prefix = "outut-"
suffix = ".html"
fname = "{prefix}{now:d}{suffix}".format(now=time.time())
print fname
with open(fname, "w") as f:
f.write(page_content)
Let your external loop (out of Python) create the file name
This file name can be e.g. on Linux created by some form of date command.