Problem with not being able to open file after downloading - python

I have a script which is used to do some scraping on Reddit.
import praw
import requests
def reddit_scrape():
count = 0
for submission in subreddit.new(limit = 100):
if (is_known_id(submission_id = submission.id)):
print('known')
continue
save_id(submission.id)
save_to_dict(id = submission.id, txt = submission.title)
img_data = requests.get(submission.url).content
with open(submission.id, 'wb') as handler:
handler.write(img_data)
print(submission.url)
count += 1
if count >= 3: break
However, when I try to open the file saved as handler, it does not have an extension and I am not able to open it.
I have no idea what is causing this issue, as it was working perfectly a while ago.
Feel free to let me know if I am missing any info, as this is just part of the entire script.

Related

Want to download images from website in stead of json file

We need the images from the website <https://api.data.gov.sg/v1/transport/traffic-images >.But the below script download json file.But we want to download images directly .I am beginner .Thanks in advance
from threading import Timer
import time
import requests
startlog = time.time()
image_url = "https://api.data.gov.sg/v1/transport/traffic-images"
tm = 0
while True:
tm += 1
r = requests.get(image_url) # create HTTP response object
with open(str(tm)+"trafficFile.json", 'wb') as f:
f.write(r.content)
print(tm)
time.sleep(20)
This small piece of code written above will download the following image from the web. Now check your local directory(the folder where this script resides), and you will find this image.

Python script not downloading files

I have the code below. It prints out the URL in the console. I'm having trouble figuring out how to get it to just download it instead of displaying it. I also want to be able to search for .mov file type. I'd rather have information on how to do this rather than it done for me. Any help is appreciated!
import urllib
def is_download_allowed():
f = urllib.urlopen("http://10.1.1.27/config?action=get&paramid=eParamID_MediaState")
response = f.read()
if (response.find('"value":"1"') > -1):
return True
f = urllib.urlopen("http://10.1.1.27/config?action=set&paramid=eParamID_MediaState&value=1")
def download_clip():
url = "http://10.1.1.27/media/SC1ATK26"
print url
def is_not_download_allowed():
f = urllib.urlopen("http://10.1.1.27/config?action=get&paramid=eParamID_MediaState")
response = f.read()
if (response.find('"value":"-1"') > 1):
return True
f = urllib.urlopen("http://10.1.1.27/config?action=set&paramid=eParamID_MediaState&value=1")
is_download_allowed()
download_clip()
is_not_download_allowed()
You say you don’t want a full solution so ...
Try urllib.urlretrieve
As commented already your download function is just printing a string.

How to run python functions sequentially

In the code below, "list.py" will read target_list.txt and create a domain list as "http://testsites.com".
Only when this process is completed, I know that target_list is finished, and my other function must run. How do I sequence them properly?
#!/usr/bin/python
import Queue
targetsite = "target_list.txt"
def domaincreate(targetsitelist):
for i in targetsite.readlines():
i = i.strip()
Url = "http://" + i
DomainList = open("LiveSite.txt", "rb")
DomainList.write(Url)
DomainList.close()
def SiteBrowser():
TargetSite = "LiveSite.txt"
Tar = open(TargetSite, "rb")
for Links in Tar.readlines():
Links = Links.strip()
UrlSites = "http://www." + Links
browser = webdriver.Firefox()
browser.get(UrlSites)
browser.save_screenshot(Links+".png")
browser.quit()
domaincreate(targetsite)
SiteBrowser()
I suspect that, whatever problem you have, a large part is because you are trying to write to a file that is open read-only. If you're running on Windows, you may later have a problem that you are in binary mode, but writing a text file (under a UNIX-based system, there's no problem).

Can't read and write files fast enough for my script to recognize changes

Im writing a script that will eventually be able to tweet form a twitter account when my favourite YouTuber Casey Neistat uploads a new video. However, in order to do that, I wrote a program that (should) be able to compare a 'output.txt' file of all the links to his previous videos to a new one when it recognizes that the previous list of YouTube links does not include a recently uploaded video. I made two methods, one called 'mainloop' that runs over and over to see if a previous list of all Casey Neistat's videos is the same as a string of new links retrieved from the method 'getNeistatNewVideo'. However the problem i'm having, is that once the program recognizes a new video, it goes to the method 'getNewURL' that will take the first link recorded in the 'output.txt' file. But when I say to print this new URL, it says there is nothing there. My hunch is that this is because python is not reading and writing to the output.txt file fast enough, however I may be wrong.
My code is as follows:
import bs4
import requests
import re
import time
import tweepy
'''
This is the information required for Tweepy
CONSUMER_KEY =
CONSUMER_SECRET =
ACCESS_KEY =
ACCESS_SECRET =
auth = tweepy.OAuthHandler(CONSUMER_KEY, CONSUMER_SECRET)
auth.set_access_token(ACCESS_KEY, ACCESS_SECRET)
api = tweepy.API(auth)
End of Tweepy Information
'''
root_url = 'https://www.youtube.com/'
index_url = root_url + 'user/caseyneistat/videos'
def getNeistatNewVideo():
response = requests.get(index_url)
soup = bs4.BeautifulSoup(response.text)
return [a.attrs.get('href') for a in soup.select('div.yt-lockup-thumbnail a[href^=/watch]')]
def mainLoop():
results = str("\n".join(getNeistatNewVideo()))
past_results = open('output.txt').read()
if results == past_results:
print("No new videos at this time")
else:
print("There is a new video!")
print('...')
print('Writing to new text file')
print('...')
f = open("output.txt", "w")
f.write(results)
print('...')
print('Done writing to new text file')
print('...')
getNewURL()
def getNewURL():
url_search = open('output.txt').read()
url_select = re.search('(.+)', url_search)
print("New Url found: " + str(url_select))
while True:
mainLoop()
time.sleep(10)
pass
You never close the files and that may be the problem. For instance, in mainLoop() you should have:
f = open("output.txt", "w")
f.write(results)
f.close()
or even better:
with open('output.txt', 'w') as output:
output.write(results)
In general, it's a good idea to use the with statement in all places where you open a file (even if it's in 'r' mode) as it automatically will take care of closing the file and it also makes it clear which section of the code is working on/with the file at a given time.

Multi-threading for downloading NCBI files in Python

So recently I have taken on the task of downloading large collection of files from the ncbi database. However I have run into times where I have to create multiple databases. This code here which works to downloads all the viruses from the ncbi website. My question is there any way to speed up the process of downloading these files.
Currently the runtime of this program is more than 5hours. I have looked into multi-threading and could never get it to work because some of these files take more than 10seconds to download and I do not know how to handle stalling. (new to programing) Also is there a way of handling urllib2.HTTPError: HTTP Error 502: Bad Gateway. I get this sometimes with with certain combinations of retstart and retmax. This crashes the program and I have to restart the download from a different location by changingthe 0 in the for statement.
import urllib2
from BeautifulSoup import BeautifulSoup
#This is the SearchQuery into NCBI. Spaces are replaced with +'s.
SearchQuery = 'viruses[orgn]+NOT+Retroviridae[orgn]'
#This is the Database that you are searching.
database = 'protein'
#This is the output file for the data
output = 'sample.fasta'
#This is the base url for NCBI eutils.
base = 'http://eutils.ncbi.nlm.nih.gov/entrez/eutils/'
#Create the search string from the information above
esearch = 'esearch.fcgi?db='+database+'&term='+SearchQuery+'&usehistory=y'
#Create your esearch url
url = base + esearch
#Fetch your esearch using urllib2
print url
content = urllib2.urlopen(url)
#Open url in BeautifulSoup
doc = BeautifulSoup(content)
#Grab the amount of hits in the search
Count = int(doc.find('count').string)
#Grab the WebEnv or the history of this search from usehistory.
WebEnv = doc.find('webenv').string
#Grab the QueryKey
QueryKey = doc.find('querykey').string
#Set the max amount of files to fetch at a time. Default is 500 files.
retmax = 10000
#Create the fetch string
efetch = 'efetch.fcgi?db='+database+'&WebEnv='+WebEnv+'&query_key='+QueryKey
#Select the output format and file format of the files.
#For table visit: http://www.ncbi.nlm.nih.gov/books/NBK25499/table/chapter4.chapter4_table1
format = 'fasta'
type = 'text'
#Create the options string for efetch
options = '&rettype='+format+'&retmode='+type
#For statement 0 to Count counting by retmax. Use xrange over range
for i in xrange(0,Count,retmax):
#Create the position string
poision = '&retstart='+str(i)+'&retmax='+str(retmax)
#Create the efetch URL
url = base + efetch + poision + options
print url
#Grab the results
response = urllib2.urlopen(url)
#Write output to file
with open(output, 'a') as file:
for line in response.readlines():
file.write(line)
#Gives a sense of where you are
print Count - i - retmax
To download files using multiple threads:
#!/usr/bin/env python
import shutil
from contextlib import closing
from multiprocessing.dummy import Pool # use threads
from urllib2 import urlopen
def generate_urls(some, params): #XXX pass whatever parameters you need
for restart in range(*params):
# ... generate url, filename
yield url, filename
def download((url, filename)):
try:
with closing(urlopen(url)) as response, open(filename, 'wb') as file:
shutil.copyfileobj(response, file)
except Exception as e:
return (url, filename), repr(e)
else: # success
return (url, filename), None
def main():
pool = Pool(20) # at most 20 concurrent downloads
urls = generate_urls(some, params)
for (url, filename), error in pool.imap_unordered(download, urls):
if error is not None:
print("Can't download {url} to {filename}, "
"reason: {error}".format(**locals())
if __name__ == "__main__":
main()
You should use multithreading, it's the right way for downloading tasks.
"these files take more than 10seconds to download and I do not know how to handle stalling",
I don't think this would be a problem because Python's multithreading will handle this, or I'd rather say multithreading is just for this kind of I/O-bound work. When a thread is waiting for download to complete, CPU will let other threads do their work.
Anyway, you'd better at least try and see what happen.
Two ways to effect your task. 1. Using process instead of thread, multiprocess is the module you should use. 2. Using Event-based, gevent is the right module.
502 error is not your script's fault. Simply, following pattern could be used to do retry
try_count = 3
while try_count > 0:
try:
download_task()
except urllib2.HTTPError:
clean_environment_for_retry()
try_count -= 1
In the line of except, you can refine the detail to do particular things according to concrete HTTP status code.

Categories