I am doing automation on the website of the company I work for.
I have a difficulty, I think it may be simple to solve but I couldn't think of a solution to solve it.
I have this code example
from selenium import webdriver
from time import sleep
import csv
URL = 'XXXXX'
URL2 = 'XXXXX'
user = 'XXXXX'
password = 'XXXXX'
filename = './geradores.csv'
def Autonomation():
driver = webdriver.Ie()
driver.get(URL)
driver.find_elemen_by_name('Login').send_keys(user)
driver.find_element_by_name('password').send_keys(password)
sleep(5)
driver.execute_script("window.open()")
driver.switch_to.window(driver.window_handles[1])
driver.get(URL2)
driver.maximize_window()
with open(filename, 'r') as writer:
reader = csv.DictReader(writer, delimiter=';')
for linha in reader:
folder = linha['Pasta']
rsc = linha['Rascunho']
driver.find_element_by_link_text('Geradores').click()
sleep(5)
driver.find_element_by_name('gErador').send_keys(folder)
driver.find_element_by_name('bloco').send_keys(rsc)
driver.find_element_by_id('salva').click()
driver.find_element_by_link_text('Começo').click()
if __name__ == '__main___':
while True: # this part causes the code to reload
try:
Autonomation()
execept:
driver.quit()
Autonomation()
The problem I face is that when the the code is reloaded automatically, it reads the first line of CSV again, and can´t save the same folder
Accurate When the code is automatically reloaded, it starts reading on the same line it stopped.
example: if the code is running and is reading line 200 and the page timeout is reached the code is automatically reloaded, it will read where it left off on line 200
The number of rows in the CSV 5000K.
timeout = 40 min
I even thought of reading CSV in a separate file and calling the CSV file as a module in autonomation.
From what I understand,
Your code is a "Data Entry" work.
You are reading CSV Data to do a web page form filling.
Problem is this code is not reading the data from the start when the page refreshes.
Try to read the data at once. Write to a file, for all the completed records.
Option1:
The simple option - Try to split this large CSV files into smaller files. Say if you can update 550 records before a page goes for refresh then update 500 records and wait out till the page is refreshed !
Option2:
Can you actually have a way to check if the page is going to refresh? If it is possible then do this
Keep a counter of how many records are updated. When the page is about to refresh, save this data to a temp file.
Now update your code to check if a temp is present and then get that counter number. Skip these no.of records and continue the work
Related
I currently use Selenium to click a download link to download a CSV file into directory, wait for the download to complete, then read the file from directory into a python variable.
I want to deploy the script into a Docker container and as far as i'm aware, I can't load the CSV the way I currently am.
I also don't know how the CSV file is created by the download button so I can't call webdriver.Firefox.execute_script().
Is there a way to intercept Firefox when downloading to see the file and save it straight to a variable at that point?
Is there an easy way to see how the CSV is created? (as I can't read the website code)
If no to the above, is there a way I can perform my current actions inside a Docker container that will be hosted on a cloud server?
My current code for reference
# Driver set-up
firefox_driver_path = "Path to geckodriver.exe"
downloaded_csv_path = "Path to specific .csv file when downloaded"
firefox_options = webdriver.FirefoxOptions()
firefox_options.headless = True
firefox_options.set_preference("browser.download.dir", "Path to download .csv file")
firefox_options.set_preference("browser.download.folderList", 2)
firefox_options.set_preference("browser.download.useDownloadDir", True)
firefox_options.set_preference("browser.download.viewableInternally.enabledTypes", "")
firefox_options.set_preference("browser.helperApps.neverAsk.saveToDisk", "application/pdf;text/plain;application/text;text/xml;text/csv;application/xml")
driver = webdriver.Firefox(executable_path=firefox_driver_path, options=firefox_options)
wait = WebDriverWait(driver, 10)
# Navigate to webpage, perform actions
driver.get("website URL")
# Perform other actions to navigate to the download button
driver.find_element_by_id('DownloadButton').click()
# Check if download is complete
time_to_wait = 10
time_counter = 0
while not os.path.exists(downloaded_csv_path):
time.sleep(1)
time_counter += 1
if time_counter > time_to_wait:
sys.exit("First CSV download didn't complete")
# Read the CSV data and delete the downloaded file
with open(downloaded_csv_path) as csvFile:
my_CSV = csv.reader(csvFile, delimiter=',')
os.unlink(downloaded_csv_path)
UPDATE
One download button has what seems an Angular function attached to it ng-click="$ctrl.onExportTransactions()".
I know nothing of angular and after searching every source file for the website (including 3 that are named "Angular xxx", I can't find function called "onExportTransactions". Is there a way this function can be invoked from Python/Selenium?
I'm trying to create a script that :
Open Browser
-> Go to a website (logging page)
-> Auto Logging (filling up email and password details from csv file )
-> Close Tab
-> Re open again the website
-> Re Auto logging but with the second account (filling up details from csv file SECOND ROW ) .
...
Re do the same tasks 50 times (From account 1 to 50 for example)
import pandas as pd
from selenium import webdriver
//Open Browser and go to facebook logging page
browser = webdriver.Chrome(r'C:\Users\Hamza\Desktop\Python\chromedriver')
browser.get('https://facebook.com')
//Import csv file
data = pd.read_excel (r'C:\Users\Hamza\Desktop\testcsv.xlsx')
I actually went on the Facebook website and pulled the source codes and wrote a little something extra quickly to log you in to the website
import pandas as pd
from selenium import webdriver
# Open Browser and go to facebook logging page
browser = webdriver.Chrome(r'C:\Users\Hamza\Desktop\Python\chromedriver')
browser.get('https://facebook.com')
# Import csv file
data = pd.read_excel (r'C:\Users\Hamza\Desktop\testcsv.xlsx')
i = 0
while i == 0:
a = 0
Username = df.username
Password = df.password
# Sends username
id_box = driver.find_element_by_class_id('email')
id_box.send_keys(Username[a])
# Sends password
Pass_box = driver.find_element_by_class_name('pass')
Pass_box.send_keys(Password[a])
# Click login
Send = driver.find_element_by_css_selector("u_0_3")
Send.click()
try:
test = driver.find_element_by_class_name('pass')
id_box.clear()
Pass_box.clear()
except:
print("logged in")
break
a = a + 1
However this is assuming that your csv files has the files saved in columns named username and password, so you might have to tweak it
The best way to do this in my opinion is to open the CSV file in dict().
Here's my code it may help you guys. Ignore detail this is nothing just the file I'm working on.
with open('C:\Users\Hamza\Desktop\testcsv.csv','rt')as f:
data = csv.DictReader(f)
for detail in data:
numberOfBedrooms=detail['numberOfBedrooms']
numberOfBathrooms=detail['numberOfBathrooms']
pricePerMonth=detail['pricePerMonth']
adress=detail['adress']
description=detail['description']
square_feet=detail['square_feet']
bedrooms = driver.find_element_by_xpath('//*[#id="jsc_c_12" or text()="Number
of bathrooms"]')
bedrooms.send_keys(numberOfBathrooms)
Loop through your data and store the data you want in variable then use variable to sendeys. Just like I did in example bedrooms.send_keys(numberOfBathrooms)
I am trying to write a script (Python 2.7.11, Windows 10) to collect data from an API and append it to a csv file.
The API I want to use returns data in json.
It limits the # of displayed records though, and pages them.
So there is a max number of records you can get with a single query, and then you have to run another query, changing the page number.
The API informs you about the nr of pages a dataset is divided to.
Let's assume that the max # of records per page is 100 and the nr of pages is 2.
My script:
import json
import urllib2
import csv
url = "https://some_api_address?page="
limit = "&limit=100"
myfile = open('C:\Python27\myscripts\somefile.csv', 'ab')
def api_iterate():
for i in xrange(1, 2, 1):
parse_url = url,(i),limit
json_page = urllib2.urlopen(parse_url)
data = json.load(json_page)
for item in data['someobject']:
print item ['some_item1'], ['some_item2'], ['some_item3']
f = csv.writer(myfile)
for row in data:
f.writerow([str(row)])
This does not seem to work, i.e. it creates a csv file, but the file is not populated. There is obviously something wrong with either the part of the script which builds the address for the query OR the part dealing with reading json OR the part dealing with writing query to csv. Or all of them.
I have tried using other resources and tutorials, but at some point I got stuck and I would appreciate your assistance.
The url you have given provides a link to the next page as one of the objects. You can use this to iterate automatically over all of the pages.
The script below gets each page, extracts two of the entries from the Dataobject array and writes them to an output.csv file:
import json
import urllib2
import csv
def api_iterate(myfile):
url = "https://api-v3.mojepanstwo.pl/dane/krs_osoby"
csv_myfile = csv.writer(myfile)
cols = ['id', 'url']
csv_myfile.writerow(cols) # Write a header
while True:
print url
json_page = urllib2.urlopen(url)
data = json.load(json_page)
json_page.close()
for data_object in data['Dataobject']:
csv_myfile.writerow([data_object[col] for col in cols])
try:
url = data['Links']['next'] # Get the next url
except KeyError as e:
break
with open(r'e:\python temp\output.csv', 'wb') as myfile:
api_iterate(myfile)
This will give you an output file looking something like:
id,url
1347854,https://api-v3.mojepanstwo.pl/dane/krs_osoby/1347854
1296239,https://api-v3.mojepanstwo.pl/dane/krs_osoby/1296239
705217,https://api-v3.mojepanstwo.pl/dane/krs_osoby/705217
802970,https://api-v3.mojepanstwo.pl/dane/krs_osoby/802970
In the code below, "list.py" will read target_list.txt and create a domain list as "http://testsites.com".
Only when this process is completed, I know that target_list is finished, and my other function must run. How do I sequence them properly?
#!/usr/bin/python
import Queue
targetsite = "target_list.txt"
def domaincreate(targetsitelist):
for i in targetsite.readlines():
i = i.strip()
Url = "http://" + i
DomainList = open("LiveSite.txt", "rb")
DomainList.write(Url)
DomainList.close()
def SiteBrowser():
TargetSite = "LiveSite.txt"
Tar = open(TargetSite, "rb")
for Links in Tar.readlines():
Links = Links.strip()
UrlSites = "http://www." + Links
browser = webdriver.Firefox()
browser.get(UrlSites)
browser.save_screenshot(Links+".png")
browser.quit()
domaincreate(targetsite)
SiteBrowser()
I suspect that, whatever problem you have, a large part is because you are trying to write to a file that is open read-only. If you're running on Windows, you may later have a problem that you are in binary mode, but writing a text file (under a UNIX-based system, there's no problem).
So recently I have taken on the task of downloading large collection of files from the ncbi database. However I have run into times where I have to create multiple databases. This code here which works to downloads all the viruses from the ncbi website. My question is there any way to speed up the process of downloading these files.
Currently the runtime of this program is more than 5hours. I have looked into multi-threading and could never get it to work because some of these files take more than 10seconds to download and I do not know how to handle stalling. (new to programing) Also is there a way of handling urllib2.HTTPError: HTTP Error 502: Bad Gateway. I get this sometimes with with certain combinations of retstart and retmax. This crashes the program and I have to restart the download from a different location by changingthe 0 in the for statement.
import urllib2
from BeautifulSoup import BeautifulSoup
#This is the SearchQuery into NCBI. Spaces are replaced with +'s.
SearchQuery = 'viruses[orgn]+NOT+Retroviridae[orgn]'
#This is the Database that you are searching.
database = 'protein'
#This is the output file for the data
output = 'sample.fasta'
#This is the base url for NCBI eutils.
base = 'http://eutils.ncbi.nlm.nih.gov/entrez/eutils/'
#Create the search string from the information above
esearch = 'esearch.fcgi?db='+database+'&term='+SearchQuery+'&usehistory=y'
#Create your esearch url
url = base + esearch
#Fetch your esearch using urllib2
print url
content = urllib2.urlopen(url)
#Open url in BeautifulSoup
doc = BeautifulSoup(content)
#Grab the amount of hits in the search
Count = int(doc.find('count').string)
#Grab the WebEnv or the history of this search from usehistory.
WebEnv = doc.find('webenv').string
#Grab the QueryKey
QueryKey = doc.find('querykey').string
#Set the max amount of files to fetch at a time. Default is 500 files.
retmax = 10000
#Create the fetch string
efetch = 'efetch.fcgi?db='+database+'&WebEnv='+WebEnv+'&query_key='+QueryKey
#Select the output format and file format of the files.
#For table visit: http://www.ncbi.nlm.nih.gov/books/NBK25499/table/chapter4.chapter4_table1
format = 'fasta'
type = 'text'
#Create the options string for efetch
options = '&rettype='+format+'&retmode='+type
#For statement 0 to Count counting by retmax. Use xrange over range
for i in xrange(0,Count,retmax):
#Create the position string
poision = '&retstart='+str(i)+'&retmax='+str(retmax)
#Create the efetch URL
url = base + efetch + poision + options
print url
#Grab the results
response = urllib2.urlopen(url)
#Write output to file
with open(output, 'a') as file:
for line in response.readlines():
file.write(line)
#Gives a sense of where you are
print Count - i - retmax
To download files using multiple threads:
#!/usr/bin/env python
import shutil
from contextlib import closing
from multiprocessing.dummy import Pool # use threads
from urllib2 import urlopen
def generate_urls(some, params): #XXX pass whatever parameters you need
for restart in range(*params):
# ... generate url, filename
yield url, filename
def download((url, filename)):
try:
with closing(urlopen(url)) as response, open(filename, 'wb') as file:
shutil.copyfileobj(response, file)
except Exception as e:
return (url, filename), repr(e)
else: # success
return (url, filename), None
def main():
pool = Pool(20) # at most 20 concurrent downloads
urls = generate_urls(some, params)
for (url, filename), error in pool.imap_unordered(download, urls):
if error is not None:
print("Can't download {url} to {filename}, "
"reason: {error}".format(**locals())
if __name__ == "__main__":
main()
You should use multithreading, it's the right way for downloading tasks.
"these files take more than 10seconds to download and I do not know how to handle stalling",
I don't think this would be a problem because Python's multithreading will handle this, or I'd rather say multithreading is just for this kind of I/O-bound work. When a thread is waiting for download to complete, CPU will let other threads do their work.
Anyway, you'd better at least try and see what happen.
Two ways to effect your task. 1. Using process instead of thread, multiprocess is the module you should use. 2. Using Event-based, gevent is the right module.
502 error is not your script's fault. Simply, following pattern could be used to do retry
try_count = 3
while try_count > 0:
try:
download_task()
except urllib2.HTTPError:
clean_environment_for_retry()
try_count -= 1
In the line of except, you can refine the detail to do particular things according to concrete HTTP status code.