Take uploaded files on plone and download them via a python script? - python

I created a document site on plone from which file uploads can be made. I saw that plone saves them in the filesystem in the form of a blob, now I need to take them through a python script that will process the pdfs downloaded with an OCR. Does anyone have any idea how to do it? Thank you

Not sure how to extract PDFs from BLOB-storage or if it's possible at all, but you can extract them from a running Plone-site (e.g. executing the script via a browser-view):
import os
from Products.CMFCore.utils import getToolByName
def isPdf(search_result):
"""Check mime_type for Plone >= 5.1, otherwise check file-extension."""
if mimeTypeIsPdf(search_result) or search_result.id.endswith('.pdf'):
return True
return False
def mimeTypeIsPdf(search_result):
"""
Plone-5.1 introduced the mime_type-attribute on files.
Try to get it, if it doesn't exist, fail silently.
Return True if mime_type exists and is PDF, otherwise False.
"""
try:
mime_type = search_result.mime_type
if mime_type == 'application/pdf':
return True
except:
pass
return False
def exportPdfFiles(context, export_path):
"""
Get all PDF-files of site and write them to export_path on the filessytem.
Remain folder-structure of site.
"""
catalog = getToolByName(context, 'portal_catalog')
search_results = catalog(portal_type='File', Language='all')
for search_result in search_results:
# For each PDF-file:
if isPdf(search_result):
file_path = export_path + search_result.getPath()
file_content = search_result.getObject().data
parent_path = '/'.join(file_path.split('/')[:-1])
# Create missing directories on the fly:
if not os.path.exists(parent_path):
os.makedirs(parent_path)
# Write PDF:
with open(file_path, 'w') as fil:
fil.write(file_content)
print 'Wrote ' + file_path
print 'Finished exporting PDF-files to ' + export_path
The example keeps the folder-structure of the Plone-site in the export-directory. If you want them flat in one directory, a handler for duplicate file-names is needed.

Related

Python - pygithub If file exists then update else create

Im using PyGithub library to update files. Im facing an issue with that. Generally, we have to options. We can create a new file or if the file exists then we can update.
Doc Ref: https://pygithub.readthedocs.io/en/latest/examples/Repository.html#create-a-new-file-in-the-repository
But the problem is, I want to create a new file it if doesn't exist. If exist the use update option.
Is this possible with PyGithub?
I followed The Otterlord's suggestion and I have achieved this. Sharing my code here, it maybe helpful to someone.
from github import Github
g = Github("username", "password")
repo = g.get_user().get_repo(GITHUB_REPO)
all_files = []
contents = repo.get_contents("")
while contents:
file_content = contents.pop(0)
if file_content.type == "dir":
contents.extend(repo.get_contents(file_content.path))
else:
file = file_content
all_files.append(str(file).replace('ContentFile(path="','').replace('")',''))
with open('/tmp/file.txt', 'r') as file:
content = file.read()
# Upload to github
git_prefix = 'folder1/'
git_file = git_prefix + 'file.txt'
if git_file in all_files:
contents = repo.get_contents(git_file)
repo.update_file(contents.path, "committing files", content, contents.sha, branch="master")
print(git_file + ' UPDATED')
else:
repo.create_file(git_file, "committing files", content, branch="master")
print(git_file + ' CREATED')
If you are interested in knowing if a single path exists in your repo you can use something like this and then branch your logic out from the result.
from github.Repository import Repository
def does_object_exists_in_branch(repo: Repository, branch: str, object_path: str) -> bool:
try:
repo.get_contents(object_path, branch)
return True
except github.UnknownObjectException:
return False
But the method in the approved answer is more efficient if you want to check if multiple files exist in a given path, but I wanted to provide this as an alternative.

Django/Dropbox uploaded file is 0 bytes

I CAN'T figure out why my uploaded file to Dropbox via Django is always zero bytes.
Outsize Django (using raw Python), the files get uploaded normally.
My HTML has enctype="multipart/form-data" declared.
VIEWS.PY:
if cv_form.is_valid():
cv_form.save(commit=True)
# print(request.FILES['cv'].size) == 125898
DOC = request.FILES['cv']
PATH = f'/CV/{DOC}'
upload_handler(DOC, PATH)
#http_response_happens_here()
HANDLER:
def upload_handler(DOC, PATH):
dbx = dropbox.Dropbox(settings.DROPBOX_APP_ACCESS_TOKEN)
dbx.files_upload(DOC.file.read(), PATH)
#file gets uploaded but always 0bytes in size
My bad! How did I not see it.
The solution incase anyone else runs into the same issue. Battled it for a while myself.
if cv_form.is_valid():
cv_form.save(commit=False) # commit should be False before upload
# print(request.FILES['cv'].size) == 125898
DOC = request.FILES['cv']
PATH = f'/CV/{DOC}'
upload_handler(DOC, PATH)
cv_form.save(commit=True) # now you can save the file.
#http_response_happens_here()

sublime text 3 auto-complete plugin not working

I try to write a plugin to get all the classes in current folder to do an auto-complete injection.
the following code is in my python file:
class FolderPathAutoComplete(sublime_plugin.EventListener):
def on_query_completions(self, view, prefix, locations):
folders = view.window().folders()
results = get_path_classes(folders)
all_text = ""
for result in results:
all_text += result + "\n"
#sublime.error_message(all_text)
return results
def get_path_classes(folders):
classesList = []
for folder in folders:
for root, dirs, files in os.walk(folder):
for filename in files:
filepath = root +"/"+filename
if filepath.endswith(".java"):
filepath = filepath.replace(".java","")
filepath = filepath[filepath.rfind("/"):]
filepath = filepath[1:]
classesList.append(filepath)
return classesList
but somehow when I work in a folder dir with a class named "LandingController.java" and I try to get the result, the auto complete is not working at all.
However, as you may noticed I did a error_message output of all the contents I got, there are actual a list of class name found.
Can anyone help me solve this? thank you!
It turn outs that the actual format which sublime text accept is: [(word,word),...]
but thanks to MattDMo who point out the documentation since the official documentation says nothing about the auto complete part.
For better understanding of the auto complete injection api, you could follow Zinggi's plugin DictionaryAutoComplete and this is the github link
So for a standard solution:
class FolderPathAutoComplete(sublime_plugin.EventListener):
def on_query_completions(self, view, prefix, locations):
suggestlist = self.get_autocomplete_list(prefix)
return suggestlist
def get_autocomplete_list(self, word):
global classesList
autocomplete_list = []
uniqueautocomplete = set()
# filter relevant items:
for w in classesList:
try:
if word.lower() in w.lower():
actual_class = parse_class_path_only_name(w)
if actual_class not in uniqueautocomplete:
uniqueautocomplete.add(actual_class)
autocomplete_list.append((actual_class, actual_class))
except UnicodeDecodeError:
print(actual_class)
# autocomplete_list.append((w, w))
continue
return autocomplete_list

most pythonic way to retrieve images from within a module (in HTML)

I am attempting to write a program that sends a url request to a site which then produces an animation of weather radar. I then scrape that page to get the image urls (they're stored in a Java module) and download them to a local folder. I do this iteratively over many radar stations and for two radar products. So far I have written the code to send the request, parse the html, and list the image urls. What I can't seem to do is rename and save the images locally. Beyond that, I want to make this as streamlined as possible- which is probably NOT what I've got so far. Any help 1) getting the images to download to a local folder and 2) pointing me to a more pythonic way of doing this would be great.
# import modules
import urllib2
import re
from bs4 import BeautifulSoup
##test variables##
stationName = "KBYX"
prod = ("bref1","vel1") # a tupel of both ref and vel
bkgr = "black"
duration = "1"
#home_dir = "/path/to/home/directory/folderForImages"
##program##
# This program needs to do the following:
# read the folder structure from home directory to get radar names
#left off here
list_of_folders = os.listdir(home_dir)
for each_folder in list_of_folders:
if each_folder.startswith('k'):
print each_folder
# here each folder that starts with a "k" represents a radar station, and within each folder are two other folders bref1 and vel1, the two products. I want the program to read the folders to decide which radar to retrieve the data for... so if I decide to add radars, all I have to do is add the folders to the directory tree.
# first request will be for prod[0] - base reflectivity
# second request will be for prod[1] - base velocity
# sample path:
# http://weather.rap.ucar.edu/radar/displayRad.php?icao=KMPX&prod=bref1&bkgr=black&duration=1
#base part of the path
base = "http://weather.rap.ucar.edu/radar/displayRad.php?"
#additional parameters
call = base+"icao="+stationName+"&prod="+prod[0]+"&bkgr="+bkgr+"&duration="+duration
#read in the webpage
urlContent = urllib2.urlopen(call).read()
webpage=urllib2.urlopen(call)
#parse the webpage with BeautifulSoup
soup = BeautifulSoup(urlContent)
#print (soup.prettify()) # if you want to take a look at the parsed structure
tag = soup.param.param.param.param.param.param.param #find the tag that holds all the filenames (which are nested in the PARAM tag, and
# located in the "value" parameter for PARAM name="filename")
files_in=str(tag['value'])
files = files_in.split(',') # they're in a single element, so split them by comma
directory = home_dir+"/"+stationName+"/"+prod[1]+"/"
counter = 0
for file in files: # now we should THEORETICALLY be able to iterate over them to download them... here I just print them
print file
To save images locally, something like
import os
IMAGES_OUTDIR = '/path/to/image/output/directory'
for file_url in files:
image_content = urllib2.urlopen(file_url).read()
image_outfile = os.path.join(IMAGES_OUTDIR, os.path.basename(file_url))
with open(image_outfile, 'wb') as wfh:
wfh.write(image_content)
If you want to rename them, use the name you want instead of os.path.basename(file_url).
I use these three methods for downloading from the internet:
from os import path, mkdir
from urllib import urlretrieve
def checkPath(destPath):
# Add final backslash if missing
if destPath != None and len(destPath) and destPath[-1] != '/':
destPath += '/'
if destPath != '' and not path.exists(destPath):
mkdir(destPath)
return destPath
def saveResource(data, fileName, destPath=''):
'''Saves data to file in binary write mode'''
destPath = checkPath(destPath)
with open(destPath + fileName, 'wb') as fOut:
fOut.write(data)
def downloadResource(url, fileName=None, destPath=''):
'''Saves the content at url in folder destPath as fileName'''
# Default filename
if fileName == None:
fileName = path.basename(url)
destPath = checkPath(destPath)
try:
urlretrieve(url, destPath + fileName)
except Exception as inst:
print 'Error retrieving', url
print type(inst) # the exception instance
print inst.args # arguments stored in .args
print inst
There are a bunch of examples here to download images from various sites

Create zip archive for instant download

In a web app I am working on, the user can create a zip archive of a folder full of files. Here here's the code:
files = torrent[0].files
zipfile = z.ZipFile(zipname, 'w')
output = ""
for f in files:
zipfile.write(settings.PYRAT_TRANSMISSION_DOWNLOAD_DIR + "/" + f.name, f.name)
downloadurl = settings.PYRAT_DOWNLOAD_BASE_URL + "/" + settings.PYRAT_ARCHIVE_DIR + "/" + filename
output = "Download " + torrent_name + ""
return HttpResponse(output)
But this has the nasty side effect of a long wait (10+ seconds) while the zip archive is being downloaded. Is it possible to skip this? Instead of saving the archive to a file, is it possible to send it straight to the user?
I do beleive that torrentflux provides this excat feature I am talking about. Being able to zip GBs of data and download it within a second.
Check this Serving dynamically generated ZIP archives in Django
As mandrake says, constructor of HttpResponse accepts iterable objects.
Luckily, ZIP format is such that archive can be created in single pass, central directory record is located at the very end of file:
(Picture from Wikipedia)
And luckily, zipfile indeed doesn't do any seeks as long as you only add files.
Here is the code I came up with. Some notes:
I'm using this code for zipping up a bunch of JPEG pictures. There is no point compressing them, I'm using ZIP only as container.
Memory usage is O(size_of_largest_file) not O(size_of_archive). And this is good enough for me: many relatively small files that add up to potentially huge archive
This code doesn't set Content-Length header, so user doesn't get nice progress indication. It should be possible to calculate this in advance if sizes of all files are known.
Serving the ZIP straight to user like this means that resume on downloads won't work.
So, here goes:
import zipfile
class ZipBuffer(object):
""" A file-like object for zipfile.ZipFile to write into. """
def __init__(self):
self.data = []
self.pos = 0
def write(self, data):
self.data.append(data)
self.pos += len(data)
def tell(self):
# zipfile calls this so we need it
return self.pos
def flush(self):
# zipfile calls this so we need it
pass
def get_and_clear(self):
result = self.data
self.data = []
return result
def generate_zipped_stream():
sink = ZipBuffer()
archive = zipfile.ZipFile(sink, "w")
for filename in ["file1.txt", "file2.txt"]:
archive.writestr(filename, "contents of file here")
for chunk in sink.get_and_clear():
yield chunk
archive.close()
# close() generates some more data, so we yield that too
for chunk in sink.get_and_clear():
yield chunk
def my_django_view(request):
response = HttpResponse(generate_zipped_stream(), mimetype="application/zip")
response['Content-Disposition'] = 'attachment; filename=archive.zip'
return response
Here's a simple Django view function which zips up (as an example) any readable files in /tmp and returns the zip file.
from django.http import HttpResponse
import zipfile
import os
from cStringIO import StringIO # caveats for Python 3.0 apply
def somezip(request):
file = StringIO()
zf = zipfile.ZipFile(file, mode='w', compression=zipfile.ZIP_DEFLATED)
for fn in os.listdir("/tmp"):
path = os.path.join("/tmp", fn)
if os.path.isfile(path):
try:
zf.write(path)
except IOError:
pass
zf.close()
response = HttpResponse(file.getvalue(), mimetype="application/zip")
response['Content-Disposition'] = 'attachment; filename=yourfiles.zip'
return response
Of course this approach will only work if the zip files will conveniently fit into memory - if not, you'll have to use a disk file (which you're trying to avoid). In that case, you just replace the file = StringIO() with file = open('/path/to/yourfiles.zip', 'wb') and replace the file.getvalue() with code to read the contents of the disk file.
Does the zip library you are using allow for output to a stream. You could stream directly to the user instead of temporarily writing to a zip file THEN streaming to the user.
It is possible to pass an iterator to the constructor of a HttpResponse (see docs). That would allow you to create a custom iterator that generates data as it is being requested. However I don't think that will work with a zip (you would have to send partial zip as it is being created).
The proper way, I think, would be to create the files offline, in a separate process. The user could then monitor the progress and then download the file when its ready (possibly by using the iterator method described above). This would be similar what sites like youtube use when you upload a file and wait for it to be processed.

Categories