Python Import Statement bug? - python

**I solved this below**: I think it may be helpful to others in the future, so I'm keeping my question up vs. taking it down. It's a python vs. other language nested file import issue. However if anyone understands the intricacies of why this is so in python an explanatory answer would greatly be appreciated.
I had my code running fine with a file directory setup like this:
sniffer //folder
-__init__.py
-Sniffer.py
-database.py
I switched it to:
Main
-snifferLaunch.py
-flashy
--sniffer
---Sniffer.py
---database.py
In theory if I change the imports to find the folders it should still run the same way...
I was under the impression that importing a python file could be done even if it was nested. For example
import Sniffer // in snifferLaunch should go through each file and try to find a Sniffer.py file.
I however found this to be false, did I misunderstand this? So I tried looking at an example which imports files like this:
import flashy.sniffer.Sniffer as Sniffer
This does import a file I believe. When I run it it traces out an error on launch however:
Traceback (most recent call last):
File "snifferLaunch.py", line 19, in <module>
import flashy.sniffer.Sniffer
File "/Users/tai/Desktop/FlashY/flashy/sniffer/__init__.py", line 110, in <module>
File "/Users/tai/Desktop/FlashY/flashy/sniffer/__init__.py", line 107, in forInFile
File "/Users/tai/Desktop/FlashY/flashy/sniffer/__init__.py", line 98, in runFlashY
File "/Users/tai/Desktop/FlashY/flashy/sniffer/__init__.py", line 89, in db
AttributeError: 'module' object has no attribute 'getDecompiledFiles'
This would normally cause me to go look for a getDecompiledFiles function. The problem is that no where in the code is there a getDecompiledFiles. There is a get_Decompiled_Files function.
My code looks something like this (non essential parts removed). Do you see my bug? I searched the entire project and could not find a getDecompiledFiles function anywhere. I don't know why it is expecting to have an attribute of this...
snifferLaunch:
import flashy.sniffer.Sniffer as Sniffer
import flashy.sniffer.database as database
import flashy.sniffer.cleaner as cleaner
def open_websites(line):
#opens a list of websites from local file "urlIn.txt" and runs the Sniffer on them.
#It retrieves the swfs from each url and storing them in the local out/"the modified url" /"hashed swf.swf" and the file contains the decompiled swf
print( "opening websites")
newSwfFiles = [];
# reads in all of the lines in urlIn.txt
#for line in urlsToRead:
if line[0] !="#":
newLine = cleaner.remove_front(line);
# note the line[:9] is done to avoid the http// which creates an additional file to go into. The remaining part of the url is still unique.
outFileDirectory = decSwfsFolder + "/" + newLine
cleaner.check_or_create_dir(outFileDirectory)
try:
newSwfFiles = Sniffer.open_url(line, []);
except:
print " Sniffer.openURL failed"
pass
# for all of the files there it runs jpex on them. (in the future this will delete the file after jpex runs so we don't run jpex more than necessary)
for location in newSwfFiles:
cleaner.check_or_create_dir(outFileDirectory + "/" + location)
#creates the command for jpex flash decompiler, the command + file to save into + location of the swf to decompile
newCommand = javaCommand + "/" + newLine + "/" + location +"/ " + swfLoc +"/"+ location
os.system(newCommand)
print ("+++this is the command: " + newCommand+"\n")
# move the swf into a new swf file for db storage
oldLocation = swfFolder + location;
newLocation = decSwfsFolder + "/" + newLine + "/" + location + "/" + "theSwf"+ "/"
cleaner.check_or_create_dir(newLocation )
if(os.path.exists(oldLocation)):
# if the file already exists at that location do not move it simply delete it (the) duplicate
if(os.path.exists(newLocation +"/"+ location)):
os.remove(oldLocation)
else:
shutil.move(swfFolder + location, newLocation)
if cleanup:
cleaner.cleanSwf();
# newSwfFiles has the directory file location of each new added file: "directory/fileHash.swf"
def db():
database.get_decompiled_files()
def run_flashY(line):
#Run FlashY a program that decompiles all of the swfs found at urls defined in urlIn.txt.
#Each decompiled file will be stored in the PaperG Amazon S3 bucket: decompiled_swfs.
#run the program for each line
#open all of the websites in the url file urlIn.txt
open_websites(line)
#store the decompiled swfs in the database
db()
#remove all files from local storage
cleaner.clean_out()
#kill all instances of firefox
def for_in_file():
#run sniffer for each line in the file
#for each url, run then kill firefox to prevent firefox buildup
for line in urlsToRead:
run_flashY(line)
cleaner.kill_firefox()
#Main Functionality
if __name__ == '__main__':
#initialize and run the program on launch
for_in_file()
The Sniffer File:
import urllib2
from urllib2 import Request, urlopen, URLError, HTTPError
import shutil
import sys
import re
import os
import hashlib
import time
import datetime
from selenium import webdriver
import glob
import thread
import httplib
from collections import defaultdict
import cleaner
a=[];
b=[];
newSwfFiles=[];
theURL='';
curPath = os.path.dirname(os.path.realpath(__file__))
#firebug gets all network data
fireBugPath = curPath +'/firebug-1.12.8b1.xpi';
#netExport exports firebug's http archive (network req/res) in the form of a har file
netExportPath = curPath +'/netExport.xpi';
harLoc = curPath +"/har/";
swfLoc = curPath +"/swfs";
cleanThis=True
#remove har file(s) after reading them out to gather swf files
profile = webdriver.firefox.firefox_profile.FirefoxProfile();
profile.add_extension( fireBugPath);
profile.add_extension(netExportPath);
hashLib = hashlib.md5()
#firefox preferences
profile.set_preference("app.update.enabled", False)
profile.native_events_enabled = True
profile.set_preference("webdriver.log.file", curPath +"webFile.txt")
profile.set_preference("extensions.firebug.DBG_STARTER", True);
profile.set_preference("extensions.firebug.currentVersion", "1.12.8");
profile.set_preference("extensions.firebug.addonBarOpened", True);
profile.set_preference('extensions.firebug.consoles.enableSite', True)
profile.set_preference("extensions.firebug.console.enableSites", True);
profile.set_preference("extensions.firebug.script.enableSites", True);
profile.set_preference("extensions.firebug.net.enableSites", True);
profile.set_preference("extensions.firebug.previousPlacement", 1);
profile.set_preference("extensions.firebug.allPagesActivation", "on");
profile.set_preference("extensions.firebug.onByDefault", True);
profile.set_preference("extensions.firebug.defaultPanelName", "net");
#set net export preferences
profile.set_preference("extensions.firebug.netexport.alwaysEnableAutoExport", True);
profile.set_preference("extensions.firebug.netexport.autoExportToFile", True);
profile.set_preference("extensions.firebug.netexport.saveFiles", True);
profile.set_preference("extensions.firebug.netexport.autoExportToServer", False);
profile.set_preference("extensions.firebug.netexport.Automation", True);
profile.set_preference("extensions.firebug.netexport.showPreview", False);
profile.set_preference("extensions.firebug.netexport.pageLoadedTimeout", 15000);
profile.set_preference("extensions.firebug.netexport.timeout", 10000);
profile.set_preference("extensions.firebug.netexport.defaultLogDir",harLoc);
profile.update_preferences();
browser = webdriver.Firefox(firefox_profile=profile);
def open_url(url,s):
#open each url, find all of the har files with them and get those files.
theURL = url;
time.sleep(6);
#browser = webdriver.Chrome();
browser.get(url); #load the url in firefox
browser.set_page_load_timeout(30)
time.sleep(3); #wait for the page to load
browser.execute_script("window.scrollTo(0, document.body.scrollHeight/5);")
time.sleep(1); #wait for the page to load
browser.execute_script("window.scrollTo(0, document.body.scrollHeight/4);")
time.sleep(1); #wait for the page to load
browser.execute_script("window.scrollTo(0, document.body.scrollHeight/3);")
time.sleep(1); #wait for the page to load
browser.execute_script("window.scrollTo(0, document.body.scrollHeight/2);")
time.sleep(1); #wait for the page to load
browser.execute_script("window.scrollTo(0, document.body.scrollHeight);")
searchText='';
time.sleep(20); #wait for the page to load
# print(browser.page_source);
#close the browser and get all the swfs from the created har file.
#uses the a & b arrays to find the swf files from generated har files
get_swfs_from_har()
#clean out the slashes
clean_f_slashes()
#get all files
get_all_files()
#ensure that some files were gained
assert a != []
assert b != []
assert newSwfFiles != []
#if the files (har, swf, out) should be cleaned out do so. This can be toggled for dubugging
if(cleanThis):
cleaner.clean_har()
return newSwfFiles;
def remove_non_url(t):
#remove matched urls that are not actually urls
a=[];
for b in t:
if(b.lower()[:4] !="http" and b.lower()[:4] != "www." ):
if(b[:2] == "//" and b.__len__() >10):
a.append(theURL+"/"+b[2:]);
else:
while((b.lower()[:4] !="http" or b.lower()[:4] !="www." or b.lower()[:1] !="//") and b.__len__() >10):
b=b[1:b.__len__()];
if( b.__len__() >10):
if(b[:1] == "//" ):
if not b in a:
a.append(theURL+b[2:b.__len__()]);
else:
if not b in a:
a.append(b);
else:
if not b in a:
a.append(b);
return a;
def get_swfs_from_har():
#validate that the files in the har are actual swf files
files = [f for f in os.listdir(harLoc) if re.match((theURL[7:]+ '.*.har'), f)]
for n in files:
with open (harLoc + n , "r") as theF:
textt = theF.read();
swfObjects= re.findall('\{[^\{]*(?:http:\/\/|https:\/\/|www\.|\/\/)[^}]*\.swf[^}]+', textt.lower())
#swfObjects = "".join(str(i) for i in swfObjects)
for obj in swfObjects:
l=[]
otherL=[]
links = re.findall('(?:http:\/\/|https:\/\/|www\.|\/\/)[^"]+', obj)
for url in links:
url=url[:url.__len__()-1]
ending = url[url.__len__()-6:];
if ".swf" in ending:
l.append(url);
elif "." not in ending:
otherL.append(url);
for c in l:
if not c in a and c.__len__() >20:
a.append(c);
if(otherL.__len__()>0):
theMostLikelyLink=otherL[0];
b.append(theMostLikelyLink);
##adds the 1st link after the swf
otherL.remove(theMostLikelyLink);
else:
b.append(None);
def clean_f_slashes():
#remove unrelated characters from swfs
for x in a:
newS='';
if(',' in x or ';' in x or '\\' in x):
for d in x:
if(d != '\\' and d != ',' and d != ';'):
newS+=d;
else:
newS=x;
if "http" not in newS.lower():
if "www" in newS:
newS= "http://" + newS;
else:
newS = "http://www."+newS
while(newS[:3]!="htt"):
newS=newS[1:];
a.remove(x);
if(newS.__len__() >15):
a.append(newS);
def get_all_files():
#get all of the files from the array of valid swfs
os.chdir(swfLoc);
for openUrl in a:
place = a.index(openUrl);
try:
req = Request(openUrl)
response = urlopen(req)
fData = urllib2.urlopen(openUrl)
iText = fData.read()
#get the hex hash of the file
hashLib.update(iText);
hashV =hashLib.hexdigest()+".swf";
outUrl= get_redirected_url(b[place]);
#check if file already exists, if it does do not add a duplicate
theFile = [f for f in os.listdir(swfLoc) if re.match((hashV), f)]
if hashV not in theFile:
lFile = open(outUrl+"," +hashV, "w")
lFile.write(iText)
lFile.close();
#except and then ignore are invalid urls.
except:
pass
#Remove all files less than 8kb, anything less than this size is unlikely to be an advertisement. Most flash ads seen so far are 25kb or larger
sFiles = [f for f in os.listdir(swfLoc)]
for filenames in sFiles:
sizeF = os.path.getsize(filenames);
#if the file is smaller remove it
if(sizeF<8000):
cleaner.remove_file(filenames)
else:
newSwfFiles.append(filenames);
def x_str(s):
#check if a unicode expression exists and convert it to a string
if s is None:
return ''
return str(s)
def get_redirected_url(s):
#get the url that another url will redirect to
if s is None:
return "";
if ".macromedia" in s:
return ""
browser.get(s);
time.sleep(20);
theredirectedurl=cleaner.removeFront(browser.current_url);
aUrl= re.findall("[^/]+",theredirectedurl)[0].encode('ascii','ignore')
return aUrl;

Interesting... so I actually realized I was going about it wrong.
I still don't know why it was expecting a function that didn't exist but I do have a guess.
I had pulled the __init__.py file to use as the snifferLaunch file. This was due to my original misunderstanding of __init__.py and assuming it was similar to a main in other languages.
I believe the __init__.pyc file was holding an old function that had been outdated. Essentially I believe there was a file that should never have been run, it was outdated and somehow getting called. It was the only file that existed that had that function in it, I overlooked it because I thought it shouldn't be called.
The solution is the following, and the bug was caused by my misuse of __init__.
I changed my import statements:
from flashy.sniffer import Sniffer
import flashy.sniffer.database as database
import flashy.sniffer.cleaner as cleaner
I created new blank __init__.py, and __init__.pyc files in flashy/sniffer/.
This prevented the false expectation for getDecompiledFiles, and also allowed the code to be run. I was getting a "cannot find this file" error because it wasn't correctly being identified as a module. Additional information on this would be appreciated if anyone can explain what was going on there. I thought you could run a python file without an init statement however when nested in other folders it appears that it must be opened as a python module.
My file structure looks like this now:
Main
-snifferLaunch.py //with changed import statements
-flashy
--sniffer
---Sniffer.py
---database.py
---__init__.py //blank
---__init__.pyc // blank
It appears to be python vs. other languages issue. Has anyone else experienced this?

Related

Having trouble using requests to download images off of wiki

I am working on a project where I need to scrape images off of the web. To do this, I write the image links to a file, and then I download each of them to a folder with requests. At first, I used Google as the scrape site, but do to several reasons, I have decided that wikipedia is a much better alternative. However, after I tried the first time, many of the images couldn't be opened, so I tried again with the change that when I downloaded the images, I downloaded them to names with endings that matched the endings of the links. More images were able to be accessed like this, but many were still not able to be opened. When I tested downloading the images myself (individually outside of the function), they downloaded perfectly, and when I used my function to download them afterwards, they kept downloading correctly (i.e. I could access them). I am not sure i it is important, but the image endings that I generally come across are svg.png and png. I want to know why this is occurring and what I may be able to do to prevent it. I have left some of my code below. Thank you.
Function:
def download_images(file):
object = file[0:file.index("IMAGELINKS") - 1]
folder_name = object + "_images"
dir = os.path.join("math_obj_images/original_images/", folder_name)
if not os.path.exists(dir):
os.mkdir(dir)
with open("math_obj_image_links/" + file, "r") as f:
count = 1
for line in f:
try:
if line[len(line) - 1] == "\n":
line = line[:len(line) - 1]
if line[0] != "/":
last_chunk = line.split("/")[len(line.split("/")) - 1]
endings = last_chunk.split(".")[1:]
image_ending = ""
for ending in endings:
image_ending += "." + ending
if image_ending == "":
continue
with open("math_obj_images/original_images/" + folder_name + "/" + object + str(count) + image_ending, "wb") as f:
f.write(requests.get(line).content)
file = object + "_IMAGEENDINGS.txt"
path = "math_obj_image_endings/" + file
with open(path, "a") as f:
f.write(image_ending + "\n")
count += 1
except:
continue
f.close()
Doing this outside of it worked:
with open("test" + image_ending, "wb") as f:
f.write(requests.get(line).content)
Example of image link file:
https://upload.wikimedia.org/wikipedia/commons/thumb/6/63/Triangle.TrigArea.svg/120px-Triangle.TrigArea.svg.png
https://upload.wikimedia.org/wikipedia/commons/thumb/c/c9/Square_%28geometry%29.svg/120px-Square_%28geometry%29.svg.png
https://upload.wikimedia.org/wikipedia/commons/thumb/3/33/Hexahedron.png/120px-Hexahedron.png
https://upload.wikimedia.org/wikipedia/commons/thumb/2/22/Hypercube.svg/110px-Hypercube.svg.png
https://wikimedia.org/api/rest_v1/media/math/render/svg/5f8ab564115bf2f7f7d12a9f873d9c6c7a50190e
https://en.wikipedia.org/wiki/Special:CentralAutoLogin/start?type=1x1
https:/static/images/footer/wikimedia-button.png
https:/static/images/footer/poweredby_mediawiki_88x31.png
If all the files are indeed in PNG format and the suffix is always .png, you could try something like this:
import requests
from pathlib import Path
u1 = "https://upload.wikimedia.org/wikipedia/commons/thumb/6/63/Triangle.TrigArea.svg/120px-Triangle.TrigArea.svg.png"
r = requests.get(u1)
Path('u1.png').write_bytes(r.content)
My previous answer works for PNG's only
For SVG files you need to check if the file contents start eith the string "<svg" and create a file with the .svg suffix.
The code below saves the downloaded files in the "downloads" subdirectory.
import requests
from pathlib import Path
# urls are stored in a file 'urls.txt'.
with open('urls.txt') as f:
for i, url in enumerate(f.readlines()):
url = url.strip() # MUST strip the line-ending char(s)!
try:
content = requests.get(url).content
except:
print('Cannot download url:', url)
continue
# Check if this is an SVG file
# Note that content is bytes hence the b in b'<svg'
if content.startswith(b'<svg'):
ext = 'svg'
elif url.endswith('.png'):
ext = 'png'
else:
print('Cannot process contents of url:', url)
Path('downloads', f'url{i}.{ext}').write_bytes(requests.get(url).content)
Contents of the urls.txt file:
(the last url is an svg)
https://upload.wikimedia.org/wikipedia/commons/thumb/6/63/Triangle.TrigArea.svg/120px-Triangle.TrigArea.svg.png
https://upload.wikimedia.org/wikipedia/commons/thumb/c/c9/Square_%28geometry%29.svg/120px-Square_%28geometry%29.svg.png
https://upload.wikimedia.org/wikipedia/commons/thumb/3/33/Hexahedron.png/120px-Hexahedron.png
https://upload.wikimedia.org/wikipedia/commons/thumb/2/22/Hypercube.svg/110px-Hypercube.svg.png
https://wikimedia.org/api/rest_v1/media/math/render/svg/5f8ab564115bf2f7f7d12a9f873d9c6c7a50190e

Importing and naming a remote file

I have written some code to read the contents from a specific url:
import requests
import os
def read_doc(doc_ID):
filename = doc_ID + ".txt"
if not os.path.exists(filename):
my_url = encode_url(doc_ID) #this is a call to another function that would encode the url
my_response = requests.get(my_url)
if my_response.status_code == requests.codes.ok:
return my_response.text
return None
This checks if there's a file named doc_ID.txt (where doc_ID could be any name provided). And if there's no such file, it would read the contents from a specific url and would return them. What I would like to do is to store those returned contents in a file called doc_ID.txt. That is, I would like to finish my function by creating a new file in case it didn't exist at the beginning.
How can I do that? I tried this:
my_text = my_response.text
output = os.rename(my_text, filename)
return output
but then, the actual contents of the file would become the name of the file and I would get an error saying the filename is too long.
So the issue I think I'm seeing is that you want to put the contents of your request's response into the file, rather than naming the file with the contents. The code below should create a file with the filename you want, and insert the text from your response!
import requests
import os
def read_doc(doc_ID):
filename = doc_ID + ".txt"
if not os.path.exists(filename):
my_url = encode_url(doc_ID) #this is a call to another function that would encode the url
my_response = requests.get(my_url)
if my_response.status_code == requests.codes.ok:
with open(filename, "w") as file:
file.write(my_response.text)
return file
return None
To write the response text to the file, you can simply use python file object, https://docs.python.org/3/tutorial/inputoutput.html#reading-and-writing-files
with open(filename, "w") as file:
file.write(my_text)

How to run python functions sequentially

In the code below, "list.py" will read target_list.txt and create a domain list as "http://testsites.com".
Only when this process is completed, I know that target_list is finished, and my other function must run. How do I sequence them properly?
#!/usr/bin/python
import Queue
targetsite = "target_list.txt"
def domaincreate(targetsitelist):
for i in targetsite.readlines():
i = i.strip()
Url = "http://" + i
DomainList = open("LiveSite.txt", "rb")
DomainList.write(Url)
DomainList.close()
def SiteBrowser():
TargetSite = "LiveSite.txt"
Tar = open(TargetSite, "rb")
for Links in Tar.readlines():
Links = Links.strip()
UrlSites = "http://www." + Links
browser = webdriver.Firefox()
browser.get(UrlSites)
browser.save_screenshot(Links+".png")
browser.quit()
domaincreate(targetsite)
SiteBrowser()
I suspect that, whatever problem you have, a large part is because you are trying to write to a file that is open read-only. If you're running on Windows, you may later have a problem that you are in binary mode, but writing a text file (under a UNIX-based system, there's no problem).

Selenium give file name when downloading

I am working with a selenium script where I am trying to download a Excel file and give it a specific name. This is my code:
Is there anyway that I can give the file being downloaded a specific name ?
Code:
#!/usr/bin/python
from selenium import webdriver
from selenium.webdriver.firefox.firefox_profile import FirefoxProfile
profile = FirefoxProfile()
profile.set_preference("browser.helperApps.neverAsk.saveToDisk", "text/plain, application/vnd.ms-excel, text/csv, text/comma-separated-values, application/octet-stream")
profile.set_preference("browser.download.dir", "C:\\Downloads" )
browser = webdriver.Firefox(firefox_profile=profile)
browser.get('https://test.com/')
browser.find_element_by_partial_link_text("Excel").click() # Download file
Here is another simple solution, where you can wait until the download completed and then get the downloaded file name from chrome downloads.
Chrome:
# method to get the downloaded file name
def getDownLoadedFileName(waitTime):
driver.execute_script("window.open()")
# switch to new tab
driver.switch_to.window(driver.window_handles[-1])
# navigate to chrome downloads
driver.get('chrome://downloads')
# define the endTime
endTime = time.time()+waitTime
while True:
try:
# get downloaded percentage
downloadPercentage = driver.execute_script(
"return document.querySelector('downloads-manager').shadowRoot.querySelector('#downloadsList downloads-item').shadowRoot.querySelector('#progress').value")
# check if downloadPercentage is 100 (otherwise the script will keep waiting)
if downloadPercentage == 100:
# return the file name once the download is completed
return driver.execute_script("return document.querySelector('downloads-manager').shadowRoot.querySelector('#downloadsList downloads-item').shadowRoot.querySelector('div#content #file-link').text")
except:
pass
time.sleep(1)
if time.time() > endTime:
break
Firefox:
def getDownLoadedFileName(waitTime):
driver.execute_script("window.open()")
WebDriverWait(driver,10).until(EC.new_window_is_opened)
driver.switch_to.window(driver.window_handles[-1])
driver.get("about:downloads")
endTime = time.time()+waitTime
while True:
try:
fileName = driver.execute_script("return document.querySelector('#contentAreaDownloadsView .downloadMainArea .downloadContainer description:nth-of-type(1)').value")
if fileName:
return fileName
except:
pass
time.sleep(1)
if time.time() > endTime:
break
Once you click on the download link/button, just call the above method.
# click on download link
browser.find_element_by_partial_link_text("Excel").click()
# get the downloaded file name
latestDownloadedFileName = getDownLoadedFileName(180) #waiting 3 minutes to complete the download
print(latestDownloadedFileName)
JAVA + Chrome:
Here is the method in java.
public String waitUntilDonwloadCompleted(WebDriver driver) throws InterruptedException {
// Store the current window handle
String mainWindow = driver.getWindowHandle();
// open a new tab
JavascriptExecutor js = (JavascriptExecutor)driver;
js.executeScript("window.open()");
// switch to new tab
// Switch to new window opened
for(String winHandle : driver.getWindowHandles()){
driver.switchTo().window(winHandle);
}
// navigate to chrome downloads
driver.get("chrome://downloads");
JavascriptExecutor js1 = (JavascriptExecutor)driver;
// wait until the file is downloaded
Long percentage = (long) 0;
while ( percentage!= 100) {
try {
percentage = (Long) js1.executeScript("return document.querySelector('downloads-manager').shadowRoot.querySelector('#downloadsList downloads-item').shadowRoot.querySelector('#progress').value");
//System.out.println(percentage);
}catch (Exception e) {
// Nothing to do just wait
}
Thread.sleep(1000);
}
// get the latest downloaded file name
String fileName = (String) js1.executeScript("return document.querySelector('downloads-manager').shadowRoot.querySelector('#downloadsList downloads-item').shadowRoot.querySelector('div#content #file-link').text");
// get the latest downloaded file url
String sourceURL = (String) js1.executeScript("return document.querySelector('downloads-manager').shadowRoot.querySelector('#downloadsList downloads-item').shadowRoot.querySelector('div#content #file-link').href");
// file downloaded location
String donwloadedAt = (String) js1.executeScript("return document.querySelector('downloads-manager').shadowRoot.querySelector('#downloadsList downloads-item').shadowRoot.querySelector('div.is-active.focus-row-active #file-icon-wrapper img').src");
System.out.println("Download deatils");
System.out.println("File Name :-" + fileName);
System.out.println("Donwloaded path :- " + donwloadedAt);
System.out.println("Downloaded from url :- " + sourceURL);
// print the details
System.out.println(fileName);
System.out.println(sourceURL);
// close the downloads tab2
driver.close();
// switch back to main window
driver.switchTo().window(mainWindow);
return fileName;
}
This is how to call this in your java script.
// download triggering step
downloadExe.click();
// now waituntil download finish and then get file name
System.out.println(waitUntilDonwloadCompleted(driver));
Output:
Download deatils
File Name :-RubyMine-2019.1.2 (7).exe
Donwloaded path :- chrome://fileicon/C%3A%5CUsers%5Csupputuri%5CDownloads%5CRubyMine-2019.1.2%20(7).exe?scale=1.25x
Downloaded from url :- https://download-cf.jetbrains.com/ruby/RubyMine-2019.1.2.exe
RubyMine-2019.1.2 (7).exe
You cannot specify name of download file through selenium. However, you can download the file, find the latest file in the downloaded folder, and rename as you want.
Note: borrowed methods from google searches may have errors. but you get the idea.
import os
import shutil
filename = max([Initial_path + "\\" + f for f in os.listdir(Initial_path)],key=os.path.getctime)
shutil.move(filename,os.path.join(Initial_path,r"newfilename.ext"))
Hope this snippet is not that confusing. It took me a while to create this and is really useful, because there has not been a clear answer to this problem, with just this library.
import os
import time
def tiny_file_rename(newname, folder_of_download):
filename = max([f for f in os.listdir(folder_of_download)], key=lambda xa : os.path.getctime(os.path.join(folder_of_download,xa)))
if '.part' in filename:
time.sleep(1)
os.rename(os.path.join(folder_of_download, filename), os.path.join(folder_of_download, newname))
else:
os.rename(os.path.join(folder_of_download, filename),os.path.join(folder_of_download,newname))
Hope this saves someone's day, cheers.
EDIT: Thanks to #Om Prakash editing my code, it made me remember that I didn't explain the code thoughly.
Using the max([]) function could lead to a race condition, leaving you with empty or corrupted file(I know it from experience). You want to check if the file is completely downloaded in the first place. This is due to the fact that selenium don't wait for the file download to complete, so when you check for the last created file, an incomplete file will show up on your generated list and it will try to move that file. And even then, you are better off waiting a little bit for the file to be free from Firefox.
EDIT 2: More Code
I was asked if 1 second was enough time and mostly it is, but in case you need to wait more than that you could change the above code to this:
import os
import time
def tiny_file_rename(newname, folder_of_download, time_to_wait=60):
time_counter = 0
filename = max([f for f in os.listdir(folder_of_download)], key=lambda xa : os.path.getctime(os.path.join(folder_of_download,xa)))
while '.part' in filename:
time.sleep(1)
time_counter += 1
if time_counter > time_to_wait:
raise Exception('Waited too long for file to download')
filename = max([f for f in os.listdir(folder_of_download)], key=lambda xa : os.path.getctime(os.path.join(folder_of_download,xa)))
os.rename(os.path.join(folder_of_download, filename), os.path.join(folder_of_download, newname))
There is something i would correct for #parishodak answer:
the filename here will only return the relative path (here the name of the file) not the absolute path.
That is why #FreshRamen got the following error after:
File "/usr/local/Cellar/python/2.7.10_2/Frameworks/Python.framework/Versions/2.7/lib/‌​python2.7/genericpath.py",
line 72, in getctime return os.stat(filename).st_ctime OSError:
[Errno 2] No such file or directory: '.localized'
There is the correct code:
import os
import shutil
filepath = 'c:\downloads'
filename = max([filepath +"\"+ f for f in os.listdir(filepath)], key=os.path.getctime)
shutil.move(os.path.join(dirpath,filename),newfilename)
I've come up with a different solution. Since you only care about the last downloaded file, then why not download it into a dummy_dir? So that, that file is going to be the only file in that directory. Once it's downloaded, you can move it to your destination_dir as well as changing it's name.
Here is an example that works with Firefox:
def rename_last_downloaded_file(dummy_dir, destination_dir, new_file_name):
def get_last_downloaded_file_path(dummy_dir):
""" Return the last modified -in this case last downloaded- file path.
This function is going to loop as long as the directory is empty.
"""
while not os.listdir(dummy_dir):
time.sleep(1)
return max([os.path.join(dummy_dir, f) for f in os.listdir(dummy_dir)], key=os.path.getctime)
while '.part' in get_last_downloaded_file_path(dummy_dir):
time.sleep(1)
shutil.move(get_last_downloaded_file_path(dummy_dir), os.path.join(destination_dir, new_file_name))
You can fiddle with the sleep time and add a TimeoutException as well, as you see fit.
Here is the code sample I used to download pdf with a specific file name. First you need to configure chrome webdriver with required options. Then after clicking the button (to open pdf popup window), call a function to wait for download to finish and rename the downloaded file.
import os
import time
import shutil
from selenium import webdriver
from selenium.webdriver.support.ui import WebDriverWait
# function to wait for download to finish and then rename the latest downloaded file
def wait_for_download_and_rename(newFilename):
# function to wait for all chrome downloads to finish
def chrome_downloads(drv):
if not "chrome://downloads" in drv.current_url: # if 'chrome downloads' is not current tab
drv.execute_script("window.open('');") # open a new tab
drv.switch_to.window(driver.window_handles[1]) # switch to the new tab
drv.get("chrome://downloads/") # navigate to chrome downloads
return drv.execute_script("""
return document.querySelector('downloads-manager')
.shadowRoot.querySelector('#downloadsList')
.items.filter(e => e.state === 'COMPLETE')
.map(e => e.filePath || e.file_path || e.fileUrl || e.file_url);
""")
# wait for all the downloads to be completed
dld_file_paths = WebDriverWait(driver, 120, 1).until(chrome_downloads) # returns list of downloaded file paths
# Close the current tab (chrome downloads)
if "chrome://downloads" in driver.current_url:
driver.close()
# Switch back to original tab
driver.switch_to.window(driver.window_handles[0])
# get latest downloaded file name and path
dlFilename = dld_file_paths[0] # latest downloaded file from the list
# wait till downloaded file appears in download directory
time_to_wait = 20 # adjust timeout as per your needs
time_counter = 0
while not os.path.isfile(dlFilename):
time.sleep(1)
time_counter += 1
if time_counter > time_to_wait:
break
# rename the downloaded file
shutil.move(dlFilename, os.path.join(download_dir,newFilename))
return
# specify custom download directory
download_dir = r'c:\Downloads\pdf_reports'
# for configuring chrome pdf viewer for downloading pdf popup reports
chrome_options = webdriver.ChromeOptions()
chrome_options.add_experimental_option('prefs', {
"download.default_directory": download_dir, # Set own Download path
"download.prompt_for_download": False, # Do not ask for download at runtime
"download.directory_upgrade": True, # Also needed to suppress download prompt
"plugins.plugins_disabled": ["Chrome PDF Viewer"], # Disable this plugin
"plugins.always_open_pdf_externally": True, # Enable this plugin
})
# get webdriver with options for configuring chrome pdf viewer
driver = webdriver.Chrome(options = chrome_options)
# open desired webpage
driver.get('https://mywebsite.com/mywebpage')
# click the button to open pdf popup
driver.find_element_by_id('someid').click()
# call the function to wait for download to finish and rename the downloaded file
wait_for_download_and_rename('My file.pdf')
# close the browser windows
driver.quit()
Set timeout (120) to the wait time as per your needs.
I am using the following function.
It checks for a file in the download location that you specify for chrome/selenium, and only is there is a file created as maxium 10 seconds ago (max_old_time), it renames it. Otherwise, it wait a maxium of 60 seconds (max_waiting_time)..
Not sure if is the best way, but it worked for me..
import os, shutil, time
from datetime import datetime
def rename_last_file(download_folder,destination_folder,newfilename):
#Will wait for maxium max_waiting_time seconds for a new in folder.
max_waiting_time=60
#Will rename only is the file creation has less than max_old_stime seconds.
max_old_time=10
start_time=datetime.now().timestamp()
while True:
filelist=[]
last_file_time=0
for current_file in os.listdir(download_folder):
filelist.append(current_file)
current_file_fullpath=os.path.join(download_folder, current_file)
current_file_time=os.path.getctime(current_file_fullpath)
if os.path.isfile(current_file_fullpath):
if last_file_time==0:
last_file=current_file
last_file_time=os.path.getctime(os.path.join(download_folder, last_file))
if current_file_time>last_file_time and os.path.isfile(current_file_fullpath):
last_file=current_file
last_file_fullpath=os.path.join(download_folder, last_file)
if start_time-last_file_time<max_old_time:
shutil.move(last_file_fullpath,os.path.join(destination_folder,newfilename))
print(last_file_fullpath)
return(0)
elif (datetime.now().timestamp()-start_time)>max_waiting_time:
print("exit")
return(1)
else:
print("waiting file...")
time.sleep(5)
Using #dmb 's trick. Ive just made one correction: after .part control, below time.sleep(1) we must request filename again. Otherwise, the line below will try to rename a .part file, which no more exists.
Here is a browser-agnostic solution that waits for the download to finish then returns the file name.
from datetime import datetime, timedelta
def wait_for_download_and_get_file_name():
print(f'Waiting for download to finish', end='')
while True:
# Get the name of the file with the latest creation time
newest_file_name = max([os.path.join(DOWNLOAD_DIR, f) for f in os.listdir(DOWNLOAD_DIR)], key=os.path.getctime)
# Get the creation time of the file
file_creation_time = datetime.fromtimestamp(os.path.getctime(newest_file_name))
five_seconds_ago = datetime.now() - timedelta(seconds=5)
if file_creation_time < five_seconds_ago:
# The file with the latest creation time is too old to be the file that we're waiting for
print(f'.', end='')
time.sleep(0.5)
else:
print(f'\nFinished downloading "{newest_file_name}"')
break
return newest_file_name
Caveat: this will not work if you have more than one thread or process downloading files to the same directory at the same time.
In my case i downloading and rename .csv files, also i using as a reference files that has '__' in the title, but you can change '_' for your specific usage.
Add this block after download on your selenium script.
string = 'SOMETHING_OR_VARIABLE'
path = r'PATH_WHERE_FILE_ARE_BEING_DOWNLOAD'
files = [i for i in os.listdir(path) if os.path.isfile(os.path.join(path,i)) and \
'_' in i]
if files != []:
import os
files = [i for i in os.listdir(path) if os.path.isfile(os.path.join(path,i)) and \
'_' in i]
print(files[0])
os.rename(path + '\\' +files[0], path + '\\' +f'{string}.csv')
else:
print('error')
You can download the file and name it at the same time using urlretrieve:
import urllib
url = browser.find_element_by_partial_link_text("Excel").get_attribute('href')
urllib.urlretrieve(url, "/choose/your/file_name.xlsx")

most pythonic way to retrieve images from within a module (in HTML)

I am attempting to write a program that sends a url request to a site which then produces an animation of weather radar. I then scrape that page to get the image urls (they're stored in a Java module) and download them to a local folder. I do this iteratively over many radar stations and for two radar products. So far I have written the code to send the request, parse the html, and list the image urls. What I can't seem to do is rename and save the images locally. Beyond that, I want to make this as streamlined as possible- which is probably NOT what I've got so far. Any help 1) getting the images to download to a local folder and 2) pointing me to a more pythonic way of doing this would be great.
# import modules
import urllib2
import re
from bs4 import BeautifulSoup
##test variables##
stationName = "KBYX"
prod = ("bref1","vel1") # a tupel of both ref and vel
bkgr = "black"
duration = "1"
#home_dir = "/path/to/home/directory/folderForImages"
##program##
# This program needs to do the following:
# read the folder structure from home directory to get radar names
#left off here
list_of_folders = os.listdir(home_dir)
for each_folder in list_of_folders:
if each_folder.startswith('k'):
print each_folder
# here each folder that starts with a "k" represents a radar station, and within each folder are two other folders bref1 and vel1, the two products. I want the program to read the folders to decide which radar to retrieve the data for... so if I decide to add radars, all I have to do is add the folders to the directory tree.
# first request will be for prod[0] - base reflectivity
# second request will be for prod[1] - base velocity
# sample path:
# http://weather.rap.ucar.edu/radar/displayRad.php?icao=KMPX&prod=bref1&bkgr=black&duration=1
#base part of the path
base = "http://weather.rap.ucar.edu/radar/displayRad.php?"
#additional parameters
call = base+"icao="+stationName+"&prod="+prod[0]+"&bkgr="+bkgr+"&duration="+duration
#read in the webpage
urlContent = urllib2.urlopen(call).read()
webpage=urllib2.urlopen(call)
#parse the webpage with BeautifulSoup
soup = BeautifulSoup(urlContent)
#print (soup.prettify()) # if you want to take a look at the parsed structure
tag = soup.param.param.param.param.param.param.param #find the tag that holds all the filenames (which are nested in the PARAM tag, and
# located in the "value" parameter for PARAM name="filename")
files_in=str(tag['value'])
files = files_in.split(',') # they're in a single element, so split them by comma
directory = home_dir+"/"+stationName+"/"+prod[1]+"/"
counter = 0
for file in files: # now we should THEORETICALLY be able to iterate over them to download them... here I just print them
print file
To save images locally, something like
import os
IMAGES_OUTDIR = '/path/to/image/output/directory'
for file_url in files:
image_content = urllib2.urlopen(file_url).read()
image_outfile = os.path.join(IMAGES_OUTDIR, os.path.basename(file_url))
with open(image_outfile, 'wb') as wfh:
wfh.write(image_content)
If you want to rename them, use the name you want instead of os.path.basename(file_url).
I use these three methods for downloading from the internet:
from os import path, mkdir
from urllib import urlretrieve
def checkPath(destPath):
# Add final backslash if missing
if destPath != None and len(destPath) and destPath[-1] != '/':
destPath += '/'
if destPath != '' and not path.exists(destPath):
mkdir(destPath)
return destPath
def saveResource(data, fileName, destPath=''):
'''Saves data to file in binary write mode'''
destPath = checkPath(destPath)
with open(destPath + fileName, 'wb') as fOut:
fOut.write(data)
def downloadResource(url, fileName=None, destPath=''):
'''Saves the content at url in folder destPath as fileName'''
# Default filename
if fileName == None:
fileName = path.basename(url)
destPath = checkPath(destPath)
try:
urlretrieve(url, destPath + fileName)
except Exception as inst:
print 'Error retrieving', url
print type(inst) # the exception instance
print inst.args # arguments stored in .args
print inst
There are a bunch of examples here to download images from various sites

Categories