In the code below, "" will read target_list.txt and create a domain list as "".
Only when this process is completed, I know that target_list is finished, and my other function must run. How do I sequence them properly?
import Queue
targetsite = "target_list.txt"
def domaincreate(targetsitelist):
for i in targetsite.readlines():
i = i.strip()
Url = "http://" + i
DomainList = open("LiveSite.txt", "rb")
def SiteBrowser():
TargetSite = "LiveSite.txt"
Tar = open(TargetSite, "rb")
for Links in Tar.readlines():
Links = Links.strip()
UrlSites = "http://www." + Links
browser = webdriver.Firefox()
I suspect that, whatever problem you have, a large part is because you are trying to write to a file that is open read-only. If you're running on Windows, you may later have a problem that you are in binary mode, but writing a text file (under a UNIX-based system, there's no problem).
I am attempting to make a program update itself to the newest version that I have made. E.g. I added a new functionality to it. It would be useful for me to be able to upload the updated file to a central location like my Raspberry Pi and have the program update itself across all of my computers without updating each one individually.
I have made the bellow code, but it does not work. It can recognize when the file is up-to-date but running the new program it downloads fails, it successfully downloads and deletes itself, but the new program is not run, with no error messages being shown.
import time
import requests
import os
import hashlib
cwd = os.getcwd()
URL = r"http://[rasberry pi's ip]/update%20files/dev/hash.txt"
hash_path = os.path.join(cwd,"remote hash.txt")
with open (hash_path, "wb") as f:
with open(hash_path,"r") as hash_file:
remotehash = (hash_file.readline()).strip()
hasher = hashlib.sha256()
with open(__file__, 'rb') as self_file:
selfunhashed =
selfhash = hasher.hexdigest()
if (selfhash == remotehash):
print("program is up to date")
update_path = os.path.join(cwd,"temp name")
URL = r"http://[rasberry pi's ip]/update%20files/dev/"
with open (update_path, "wb") as f:
with open(update_path,"r") as f:
name = f.readline().strip()
name = name[1:] #use the 1st line as "" not "# name"
update_path = os.path.join(cwd,name)
os.rename(os.path.join(cwd,"temp name"),update_path)
os.system("python \""+update_path+"\"")
print("removing self file now")
It uses a separate TXT file with the hash of the program stored in the same folder to check the remote files hash without downloading the actual file to hash it locally.
I don't know why this started happening recently. I have a function that opens a new text file, writes a url to it, then closes it, but it is not made immediately after the f.close() is executed. The problem is that a function after it open_url() needs to read from that text file a url, but since nothing is there, my program errors out.
Ironically, after my program errors out and I stop it, the url.txt file is made haha. Anyone know why this is happening with the python .write() action? Is there another way to create a text file and write a line of text to that text file faster?
def write_url():
if not path.exists('url.txt'):
url = UrlObj().url
with open('url.txt', 'w') as f:
def open_url():
x = open('url.txt', 'r')
y =
return y
def main():
scraper = Job()
url = scraper.open_url()
results = scraper.load_craigslist_url(url)
dictionary_of_listings = scraper.organizeResults(results)
if __name__ == '__main__':
scheduler = BlockingScheduler()
scheduler.add_job(main, 'interval', hours=1)
There is another class called url that prompts the user to add attributes to a bare url for seleenium to use. UrlObj().url gives you the url to write which is used to write to the new text file. If the url.txt file already exists, then pass and go to open_url()and get the url from the url.txt file to pass to the url variable which is used to start the scraping.
Just found a work around. If the file does not exist then return the url to be fed directly to load_craigslist_url. If the text file exists then just read from the text file.
I've been trying to get the hang of multithreading in Python. However, whenever I attempt to make it do something that might be considered useful, I run into issues.
In this case, I have 300 PDF files. For simplicity, we'll assume that each PDF only has a unique number on it (say 1 to 300). I'm trying to make Python open the file, grab the text from it, and then use that text to rename the file accordingly.
The non-multithreaded version I make works amazing. But it's a bit slow and I thought I'd see if I could speed it up a bit. However, this version finds the very first file, renames it correctly, and then throws an error saying:
FileNotFoundError: [Errno 2] No such file or directory: './pdfPages/1006941.pdf'
Which is it basically telling me that it can't find a file by that name. The reason it can't is because it already named it. And in my head that tells me that I've probably messed something up with this loop and/or multithreading in general.
Any help would be appreciated.
import PyPDF2
import os
from os import listdir
from os.path import isfile, join
from PyPDF2 import PdfFileWriter, PdfFileReader
from multiprocessing.dummy import Pool as ThreadPool
# Global
def readPDF(allFiles):
global i
while i < l:
pdf_file = open(path+allFiles, 'rb')
read_pdf = PyPDF2.PdfFileReader(pdf_file)
number_of_pages = read_pdf.getNumPages()
page = read_pdf.getPage(0)
page_content = page.extractText()
Text = str(page_content.encode('utf-8')).strip("b").strip("'")
pre = "77"
path = "./pdfPages/"
included_extensions = ['pdf','PDF']
allFiles = [f for f in listdir(path) if any(f.endswith(ext) for ext in included_extensions)] # Get all files in current directory
l = len(allFiles)
pool = ThreadPool(4)
doThings =, allFiles)
Yes, you have, in fact, messed up with the loop as you say. The loop should not be there at all. This is implicitly handled by the that ensures that each function call will receive a unique file name from your list to work with. You should not do any other looping.
I have updated your code below, by removing the loop and some other changes (minor, but still improvements, I think):
# Removed a number of imports
import PyPDF2
import os
from multiprocessing.dummy import Pool as ThreadPool
# Removed not needed global variable
def readPDF(allFiles):
# The while loop not needed, as will distribute the different
# files to different processes anyway
pdf_file = open(path+allFiles, 'rb')
read_pdf = PyPDF2.PdfFileReader(pdf_file)
number_of_pages = read_pdf.getNumPages()
page = read_pdf.getPage(0)
page_content = page.extractText()
Text = str(page_content.encode('utf-8')).strip("b").strip("'")
pre = "77"
path = "./pdfPages/"
included_extensions = ('pdf','PDF') # Tuple instead of list
# Tuple allows for simpler "F.endswith"
allFiles = [f for f in os.listdir(path) if f.endswith(included_ext)]
pool = ThreadPool(4)
doThings =, allFiles)
# doThings will be a list of "None"s since the readPDF returns nothing
Thus, the global variable and the counter are not needed, since all of that is handled implicitly. But, even with these changes, it is not at all certain that this will speed up your execution much. Most likely, the bulk of your program execution is waiting for the disk to load. In that case, it is possible that even if you have multiple threads, they will still have to wait for the main resource, i.e., the hard drive. But to know for certain, you have to test.
**I solved this below**: I think it may be helpful to others in the future, so I'm keeping my question up vs. taking it down. It's a python vs. other language nested file import issue. However if anyone understands the intricacies of why this is so in python an explanatory answer would greatly be appreciated.
I had my code running fine with a file directory setup like this:
sniffer //folder
I switched it to:
In theory if I change the imports to find the folders it should still run the same way...
I was under the impression that importing a python file could be done even if it was nested. For example
import Sniffer // in snifferLaunch should go through each file and try to find a file.
I however found this to be false, did I misunderstand this? So I tried looking at an example which imports files like this:
import flashy.sniffer.Sniffer as Sniffer
This does import a file I believe. When I run it it traces out an error on launch however:
Traceback (most recent call last):
File "", line 19, in <module>
import flashy.sniffer.Sniffer
File "/Users/tai/Desktop/FlashY/flashy/sniffer/", line 110, in <module>
File "/Users/tai/Desktop/FlashY/flashy/sniffer/", line 107, in forInFile
File "/Users/tai/Desktop/FlashY/flashy/sniffer/", line 98, in runFlashY
File "/Users/tai/Desktop/FlashY/flashy/sniffer/", line 89, in db
AttributeError: 'module' object has no attribute 'getDecompiledFiles'
This would normally cause me to go look for a getDecompiledFiles function. The problem is that no where in the code is there a getDecompiledFiles. There is a get_Decompiled_Files function.
My code looks something like this (non essential parts removed). Do you see my bug? I searched the entire project and could not find a getDecompiledFiles function anywhere. I don't know why it is expecting to have an attribute of this...
import flashy.sniffer.Sniffer as Sniffer
import flashy.sniffer.database as database
import flashy.sniffer.cleaner as cleaner
def open_websites(line):
#opens a list of websites from local file "urlIn.txt" and runs the Sniffer on them.
#It retrieves the swfs from each url and storing them in the local out/"the modified url" /"hashed swf.swf" and the file contains the decompiled swf
print( "opening websites")
newSwfFiles = [];
# reads in all of the lines in urlIn.txt
#for line in urlsToRead:
if line[0] !="#":
newLine = cleaner.remove_front(line);
# note the line[:9] is done to avoid the http// which creates an additional file to go into. The remaining part of the url is still unique.
outFileDirectory = decSwfsFolder + "/" + newLine
newSwfFiles = Sniffer.open_url(line, []);
print " Sniffer.openURL failed"
# for all of the files there it runs jpex on them. (in the future this will delete the file after jpex runs so we don't run jpex more than necessary)
for location in newSwfFiles:
cleaner.check_or_create_dir(outFileDirectory + "/" + location)
#creates the command for jpex flash decompiler, the command + file to save into + location of the swf to decompile
newCommand = javaCommand + "/" + newLine + "/" + location +"/ " + swfLoc +"/"+ location
print ("+++this is the command: " + newCommand+"\n")
# move the swf into a new swf file for db storage
oldLocation = swfFolder + location;
newLocation = decSwfsFolder + "/" + newLine + "/" + location + "/" + "theSwf"+ "/"
cleaner.check_or_create_dir(newLocation )
# if the file already exists at that location do not move it simply delete it (the) duplicate
if(os.path.exists(newLocation +"/"+ location)):
shutil.move(swfFolder + location, newLocation)
if cleanup:
# newSwfFiles has the directory file location of each new added file: "directory/fileHash.swf"
def db():
def run_flashY(line):
#Run FlashY a program that decompiles all of the swfs found at urls defined in urlIn.txt.
#Each decompiled file will be stored in the PaperG Amazon S3 bucket: decompiled_swfs.
#run the program for each line
#open all of the websites in the url file urlIn.txt
#store the decompiled swfs in the database
#remove all files from local storage
#kill all instances of firefox
def for_in_file():
#run sniffer for each line in the file
#for each url, run then kill firefox to prevent firefox buildup
for line in urlsToRead:
#Main Functionality
if __name__ == '__main__':
#initialize and run the program on launch
The Sniffer File:
import urllib2
from urllib2 import Request, urlopen, URLError, HTTPError
import shutil
import sys
import re
import os
import hashlib
import time
import datetime
from selenium import webdriver
import glob
import thread
import httplib
from collections import defaultdict
import cleaner
curPath = os.path.dirname(os.path.realpath(__file__))
#firebug gets all network data
fireBugPath = curPath +'/firebug-1.12.8b1.xpi';
#netExport exports firebug's http archive (network req/res) in the form of a har file
netExportPath = curPath +'/netExport.xpi';
harLoc = curPath +"/har/";
swfLoc = curPath +"/swfs";
#remove har file(s) after reading them out to gather swf files
profile = webdriver.firefox.firefox_profile.FirefoxProfile();
profile.add_extension( fireBugPath);
hashLib = hashlib.md5()
#firefox preferences
profile.set_preference("app.update.enabled", False)
profile.native_events_enabled = True
profile.set_preference("webdriver.log.file", curPath +"webFile.txt")
profile.set_preference("extensions.firebug.DBG_STARTER", True);
profile.set_preference("extensions.firebug.currentVersion", "1.12.8");
profile.set_preference("extensions.firebug.addonBarOpened", True);
profile.set_preference('extensions.firebug.consoles.enableSite', True)
profile.set_preference("extensions.firebug.console.enableSites", True);
profile.set_preference("extensions.firebug.script.enableSites", True);
profile.set_preference("", True);
profile.set_preference("extensions.firebug.previousPlacement", 1);
profile.set_preference("extensions.firebug.allPagesActivation", "on");
profile.set_preference("extensions.firebug.onByDefault", True);
profile.set_preference("extensions.firebug.defaultPanelName", "net");
#set net export preferences
profile.set_preference("extensions.firebug.netexport.alwaysEnableAutoExport", True);
profile.set_preference("extensions.firebug.netexport.autoExportToFile", True);
profile.set_preference("extensions.firebug.netexport.saveFiles", True);
profile.set_preference("extensions.firebug.netexport.autoExportToServer", False);
profile.set_preference("extensions.firebug.netexport.Automation", True);
profile.set_preference("extensions.firebug.netexport.showPreview", False);
profile.set_preference("extensions.firebug.netexport.pageLoadedTimeout", 15000);
profile.set_preference("extensions.firebug.netexport.timeout", 10000);
browser = webdriver.Firefox(firefox_profile=profile);
def open_url(url,s):
#open each url, find all of the har files with them and get those files.
theURL = url;
#browser = webdriver.Chrome();
browser.get(url); #load the url in firefox
time.sleep(3); #wait for the page to load
browser.execute_script("window.scrollTo(0, document.body.scrollHeight/5);")
time.sleep(1); #wait for the page to load
browser.execute_script("window.scrollTo(0, document.body.scrollHeight/4);")
time.sleep(1); #wait for the page to load
browser.execute_script("window.scrollTo(0, document.body.scrollHeight/3);")
time.sleep(1); #wait for the page to load
browser.execute_script("window.scrollTo(0, document.body.scrollHeight/2);")
time.sleep(1); #wait for the page to load
browser.execute_script("window.scrollTo(0, document.body.scrollHeight);")
time.sleep(20); #wait for the page to load
# print(browser.page_source);
#close the browser and get all the swfs from the created har file.
#uses the a & b arrays to find the swf files from generated har files
#clean out the slashes
#get all files
#ensure that some files were gained
assert a != []
assert b != []
assert newSwfFiles != []
#if the files (har, swf, out) should be cleaned out do so. This can be toggled for dubugging
return newSwfFiles;
def remove_non_url(t):
#remove matched urls that are not actually urls
for b in t:
if(b.lower()[:4] !="http" and b.lower()[:4] != "www." ):
if(b[:2] == "//" and b.__len__() >10):
while((b.lower()[:4] !="http" or b.lower()[:4] !="www." or b.lower()[:1] !="//") and b.__len__() >10):
if( b.__len__() >10):
if(b[:1] == "//" ):
if not b in a:
if not b in a:
if not b in a:
return a;
def get_swfs_from_har():
#validate that the files in the har are actual swf files
files = [f for f in os.listdir(harLoc) if re.match((theURL[7:]+ '.*.har'), f)]
for n in files:
with open (harLoc + n , "r") as theF:
textt =;
swfObjects= re.findall('\{[^\{]*(?:http:\/\/|https:\/\/|www\.|\/\/)[^}]*\.swf[^}]+', textt.lower())
#swfObjects = "".join(str(i) for i in swfObjects)
for obj in swfObjects:
links = re.findall('(?:http:\/\/|https:\/\/|www\.|\/\/)[^"]+', obj)
for url in links:
ending = url[url.__len__()-6:];
if ".swf" in ending:
elif "." not in ending:
for c in l:
if not c in a and c.__len__() >20:
##adds the 1st link after the swf
def clean_f_slashes():
#remove unrelated characters from swfs
for x in a:
if(',' in x or ';' in x or '\\' in x):
for d in x:
if(d != '\\' and d != ',' and d != ';'):
if "http" not in newS.lower():
if "www" in newS:
newS= "http://" + newS;
newS = "http://www."+newS
if(newS.__len__() >15):
def get_all_files():
#get all of the files from the array of valid swfs
for openUrl in a:
place = a.index(openUrl);
req = Request(openUrl)
response = urlopen(req)
fData = urllib2.urlopen(openUrl)
iText =
#get the hex hash of the file
hashV =hashLib.hexdigest()+".swf";
outUrl= get_redirected_url(b[place]);
#check if file already exists, if it does do not add a duplicate
theFile = [f for f in os.listdir(swfLoc) if re.match((hashV), f)]
if hashV not in theFile:
lFile = open(outUrl+"," +hashV, "w")
#except and then ignore are invalid urls.
#Remove all files less than 8kb, anything less than this size is unlikely to be an advertisement. Most flash ads seen so far are 25kb or larger
sFiles = [f for f in os.listdir(swfLoc)]
for filenames in sFiles:
sizeF = os.path.getsize(filenames);
#if the file is smaller remove it
def x_str(s):
#check if a unicode expression exists and convert it to a string
if s is None:
return ''
return str(s)
def get_redirected_url(s):
#get the url that another url will redirect to
if s is None:
return "";
if ".macromedia" in s:
return ""
aUrl= re.findall("[^/]+",theredirectedurl)[0].encode('ascii','ignore')
return aUrl;
Interesting... so I actually realized I was going about it wrong.
I still don't know why it was expecting a function that didn't exist but I do have a guess.
I had pulled the file to use as the snifferLaunch file. This was due to my original misunderstanding of and assuming it was similar to a main in other languages.
I believe the __init__.pyc file was holding an old function that had been outdated. Essentially I believe there was a file that should never have been run, it was outdated and somehow getting called. It was the only file that existed that had that function in it, I overlooked it because I thought it shouldn't be called.
The solution is the following, and the bug was caused by my misuse of __init__.
I changed my import statements:
from flashy.sniffer import Sniffer
import flashy.sniffer.database as database
import flashy.sniffer.cleaner as cleaner
I created new blank, and __init__.pyc files in flashy/sniffer/.
This prevented the false expectation for getDecompiledFiles, and also allowed the code to be run. I was getting a "cannot find this file" error because it wasn't correctly being identified as a module. Additional information on this would be appreciated if anyone can explain what was going on there. I thought you could run a python file without an init statement however when nested in other folders it appears that it must be opened as a python module.
My file structure looks like this now:
Main //with changed import statements
--sniffer //blank
---__init__.pyc // blank
It appears to be python vs. other languages issue. Has anyone else experienced this?
So recently I have taken on the task of downloading large collection of files from the ncbi database. However I have run into times where I have to create multiple databases. This code here which works to downloads all the viruses from the ncbi website. My question is there any way to speed up the process of downloading these files.
Currently the runtime of this program is more than 5hours. I have looked into multi-threading and could never get it to work because some of these files take more than 10seconds to download and I do not know how to handle stalling. (new to programing) Also is there a way of handling urllib2.HTTPError: HTTP Error 502: Bad Gateway. I get this sometimes with with certain combinations of retstart and retmax. This crashes the program and I have to restart the download from a different location by changingthe 0 in the for statement.
import urllib2
from BeautifulSoup import BeautifulSoup
#This is the SearchQuery into NCBI. Spaces are replaced with +'s.
SearchQuery = 'viruses[orgn]+NOT+Retroviridae[orgn]'
#This is the Database that you are searching.
database = 'protein'
#This is the output file for the data
output = 'sample.fasta'
#This is the base url for NCBI eutils.
base = ''
#Create the search string from the information above
esearch = 'esearch.fcgi?db='+database+'&term='+SearchQuery+'&usehistory=y'
#Create your esearch url
url = base + esearch
#Fetch your esearch using urllib2
print url
content = urllib2.urlopen(url)
#Open url in BeautifulSoup
doc = BeautifulSoup(content)
#Grab the amount of hits in the search
Count = int(doc.find('count').string)
#Grab the WebEnv or the history of this search from usehistory.
WebEnv = doc.find('webenv').string
#Grab the QueryKey
QueryKey = doc.find('querykey').string
#Set the max amount of files to fetch at a time. Default is 500 files.
retmax = 10000
#Create the fetch string
efetch = 'efetch.fcgi?db='+database+'&WebEnv='+WebEnv+'&query_key='+QueryKey
#Select the output format and file format of the files.
#For table visit:
format = 'fasta'
type = 'text'
#Create the options string for efetch
options = '&rettype='+format+'&retmode='+type
#For statement 0 to Count counting by retmax. Use xrange over range
for i in xrange(0,Count,retmax):
#Create the position string
poision = '&retstart='+str(i)+'&retmax='+str(retmax)
#Create the efetch URL
url = base + efetch + poision + options
print url
#Grab the results
response = urllib2.urlopen(url)
#Write output to file
with open(output, 'a') as file:
for line in response.readlines():
#Gives a sense of where you are
print Count - i - retmax
To download files using multiple threads:
#!/usr/bin/env python
import shutil
from contextlib import closing
from multiprocessing.dummy import Pool # use threads
from urllib2 import urlopen
def generate_urls(some, params): #XXX pass whatever parameters you need
for restart in range(*params):
# ... generate url, filename
yield url, filename
def download((url, filename)):
with closing(urlopen(url)) as response, open(filename, 'wb') as file:
shutil.copyfileobj(response, file)
except Exception as e:
return (url, filename), repr(e)
else: # success
return (url, filename), None
def main():
pool = Pool(20) # at most 20 concurrent downloads
urls = generate_urls(some, params)
for (url, filename), error in pool.imap_unordered(download, urls):
if error is not None:
print("Can't download {url} to {filename}, "
"reason: {error}".format(**locals())
if __name__ == "__main__":
You should use multithreading, it's the right way for downloading tasks.
"these files take more than 10seconds to download and I do not know how to handle stalling",
I don't think this would be a problem because Python's multithreading will handle this, or I'd rather say multithreading is just for this kind of I/O-bound work. When a thread is waiting for download to complete, CPU will let other threads do their work.
Anyway, you'd better at least try and see what happen.
Two ways to effect your task. 1. Using process instead of thread, multiprocess is the module you should use. 2. Using Event-based, gevent is the right module.
502 error is not your script's fault. Simply, following pattern could be used to do retry
try_count = 3
while try_count > 0:
except urllib2.HTTPError:
try_count -= 1
In the line of except, you can refine the detail to do particular things according to concrete HTTP status code.