Taking screenshots from multiple urls in a file - python

I'm trying to take screenshots of pages for which I have URLs in a .txt file. When I tried just a single URL such as "http://testing.com" it works, but when I assign it to a variable instead of directly using a string it's not working. Here's the code:
def capture(self, url, output_file):
self.load(QUrl(url))
self.wait_load()
file_list = open("LiveSite.txt")
for site in file_list.readlines():
time.sleep(5)
s.capture(site, site + ".png")

The site variable will contain the newline character for every line in the file. You can try:
site = site.strip()
before calling s.capture:
def capture(self, url, output_file):
self.load(QUrl(url))
self.wait_load()
File_List = open("LiveSite.txt")
for site in File_List.readlines():
time.sleep(5)
site = site.strip()
s.capture(site, site + ".png")

Related

Change twitter banner from url

How would I go by changing the twitter banner using an image from url using tweepy library: https://github.com/tweepy/tweepy/blob/v2.3.0/tweepy/api.py#L392
So far I got this and it returns:
def banner(self):
url = 'https://blog.snappa.com/wp-content/uploads/2019/01/Twitter-Header-Size.png'
file = requests.get(url)
self.api.update_profile_banner(filename=file.content)
ValueError: stat: embedded null character in path
It seems like filename requires an image to be downloaded. Anyway to process this without downloading the image and then removing it?
Looking at library's code you can do what you want.
def update_profile_banner(self, filename, *args, **kargs):
f = kargs.pop('file', None)
So what you need to do is supply the filename and the file kwarg:
filename = url.split('/')[-1]
self.api.update_profile_banner(filename, file=file.content)
import tempfile
def banner():
url = 'file_url'
file = requests.get(url)
temp = tempfile.NamedTemporaryFile(suffix=".png")
try:
temp.write(file.content)
self.api.update_profile_banner(filename=temp.name)
finally:
temp.close()

How can i make this code (Google StreetView) work? I'm getting 403 forbidden messages everytime, even tho i set up a user agent

So I'm trying out a program which downloads google street view images. The addresses are located in a .txt file. Every time i try to run the code, the HTTP Error 403: Forbidden comes up. In my actual code, i use my Google Developer API of course, and the right file paths.
I've tried to set up a user agent, but it just doesn't work. Can anyone help me, what should i do? And how do i implement it in this code?
# import os and urllib modules
# os for file path creation
# urllib for accessing web content
import urllib
import os
import requests
# this is the first part of the streetview, url up to the address, this url will return a 600x600px image
pre = "https://maps.googleapis.com/maps/api/streetview?size=600x600&location="
# this is the second part of the streetview url, the text variable below, includes the path to a text file containing one address per line
# the addresses in this text file will complete the URL needed to return a streetview image and provide the filename of each streetview image
text = r"C:\Users\.............
# this is the third part of the url, needed after the address
# this is my API key, please replace the one below with your own (google 'google streetview api key'), thanks!
suf = "&key=abcdertghjhrtrwhgrh"
# this is the directory that will store the streetview images
# this directory will be created if not present
dir = r"C:\Users\..........
headers = {'User-Agent': 'Chrome/75.0.3770.100 Safari/537.36',
'From': 'asdasd#asd.com'
}
# checks if the dir variable (output path) above exists and creates it if it does not
if not os.path.exists(dir):
os.makedirs(dir)
# opens the address list text file (from the 'text' variable defined above) in read mode ("r")
with open(text, "r") as text_file:
# the variable 'lines' below creates a list of each address line in the source 'text' file
lines = [line.rstrip('\n') for line in open(text)]
print
"THE CONTENTS OF THE TEXT FILE:\n" + str(lines)
# start a loop through the 'lines' list
for line in lines:
# string clean-up to get rid of commas in the url and filename
ln = line.replace(",", "")
print
"CLEANED UP ADDRESS LINE:\n" + ln
# creates the url that will be passed to the url reader, this creates the full, valid, url that will return a google streetview image for each address in the address text file
URL = pre + ln + suf
response = requests.get(URL, headers = headers)
"URL FOR STREETVIEW IMAGE:\n" + URL
# creates the filename needed to save each address's streetview image locally
filename = os.path.join(dir, ln + ".jpg")
print
"OUTPUT FILENAME:\n" + filename
# you can run this up to this line in the python command line to see what each step does
# final step, fetches and saves the streetview image for each address using the url created in the previous steps
urllib.urlretrieve(URL, filename)

scrapy - download with correct extension

I have a following spider:
class Downloader(scrapy.Spider):
name = "sor_spider"
download_folder = FOLDER
def get_links(self):
df = pd.read_excel(LIST)
return df["Value"].loc
def start_requests(self):
urls = self.get_links()
for url in urls.iteritems():
index = {"index" : url[0]}
yield scrapy.Request(url=url[1], callback=self.download_file, errback=self.errback_httpbin, meta=index, dont_filter=True)
def download_file(self, response):
url = response.url
index = response.meta["index"]
content_type = response.headers['Content-Type']
download_path = os.path.join(self.download_folder, r"{}".format(str(index)))
with open(download_path, "wb") as f:
f.write(response.body)
yield LinkCheckerItem(index=response.meta["index"], url=url, code="downloaded")
def errback_httpbin(self, failure):
yield LinkCheckerItem(index=failure.request.meta["index"], url=failure.request.url, code="error")
It should:
read excel with links (LIST)
go to each link and download file to the FOLDER
log results in LinkCheckerItem(I am exporting it to csv)
That would normally work fine but my list contains files of different types - zip, pdf, doc etc.
These are the examples of links in my LIST:
https://disclosure.1prime.ru/Portal/GetDocument.aspx?emId=7805019624&docId=2c5fb68702294531afd03041e877ca84
http://www.e-disclosure.ru/portal/FileLoad.ashx?Fileid=1173293
http://www.e-disclosure.ru/portal/FileLoad.ashx?Fileid=1263289
https://disclosure.1prime.ru/Portal/GetDocument.aspx?emId=7805019624&docId=eb9f06d2b837401eba9c66c8bf5be813
http://e-disclosure.ru/portal/FileLoad.ashx?Fileid=952317
http://e-disclosure.ru/portal/FileLoad.ashx?Fileid=1042224
https://www.e-disclosure.ru/portal/FileLoad.ashx?Fileid=1160005
https://www.e-disclosure.ru/portal/FileLoad.ashx?Fileid=925955
https://www.e-disclosure.ru/portal/FileLoad.ashx?Fileid=1166563
http://npoimpuls.ru/templates/npoimpuls/material/documents/%D0%A1%D0%BF%D0%B8%D1%81%D0%BE%D0%BA%20%D0%B0%D1%84%D1%84%D0%B8%D0%BB%D0%B8%D1%80%D0%BE%D0%B2%D0%B0%D0%BD%D0%BD%D1%8B%D1%85%20%D0%BB%D0%B8%D1%86%20%D0%BD%D0%B0%2030.06.2016.pdf
http://нпоимпульс.рф/assets/download/sal30.09.2017.pdf
http://www.e-disclosure.ru/portal/FileLoad.ashx?Fileid=1166287
I would like it to save file with its original extension, whatever it is... Just like my browser when it opens an alert to save file.
I tried to use response.headers["Content-type"] to find out the type but in this case it's always application/octet-stream .
How could I do it?
You need to parse Content-Disposition header for the correct file name.

Scrapy:Rename downloaded file name to a string with unicode cause messy characters

I'm trying to scrap some PDF from a website using, instead of letting Scrapy name these files I want to name these PDF with the titles I scraped from the website.So I define ReportsPDFPipeline and override the file_path function.
class ReportsPDFPipeline(FilesPipeline):
def file_path(self, request, response = None, info = None):
#print("我被调用了")
file_guid = request.meta["title"]
return "full/%s" % (file_guid)
The problem is that there are some unicode(Chinese) characters in the title, so no PDF files were stored in this path.
Then I tried a simple case:
class ReportsPDFPipeline(FilesPipeline):
def file_path(self, request, response = None, info = None):
#print("我被调用了")
return u"full/" + u"我被调用了" + u".PDF"
This time, the file could be renamed and stored but there are some messy characters like this:
What am I supposed to do to rename the files correctly?

Response as a CSV in Python

I have the following cherrypy function. The function suppose the response with a CSV file. However, It doesn't works, and i get the following response: Something went wrong here List(UnacceptedResponseContentTypeRejection(WrappedArray(text/csv))).
#cherrypy.expose
def myFunction(self, id, a):
url = "clusters/downAsCSV?id=" + id + "&a=" + a + "&csv=true"
htmlText = self.general_url(url)
cherrypy.response.headers['Content-Type'] = 'text/csv'
return htmlText
When i paste the same URL into the browser it does works and the CSV been downloaded into the client.
What could be the reason?

Categories