How to prevent upload of "Bad images" in Django? (PILLOW, easy-thumbnails) - python

Intro
I have been using Django with easy-thumnails for quite a while, and today I stumbled upon one really nasty bug, but because I have always permitted easy-thumbnails to do as it pleases I consider myself a noob at this.
TL;DR
I need to validate if an image can be read by easy-thumbnails or Pillow before it is saved to the django model in an optimal way.
Explanation
When I tried to convert an .svg image, apparently Pillow crashes on the spot. I don't know if it is because of the format, even though according to some other stack issues, by installing libz or zlib1g, there should not be a problem (they are already installed in my system btw).
But the format doesn't matter, if I insert a corrupt file as an image it could make the library crash every time.
I need to be able to validate (inside my django-forms or my drf-serializers) if an image can be read by pillow before saving it and prevent this to ever happen again.
Any ideas to validate the file in an optimal way would be greatly appreciated.

Related

Is pytesseract safe to use with confidential images?

I am working on a project for my company which tries to read scanned pdfs and classify them depending on their contents.
After doing some research online, the easiest way to solve this seems to be by using a Python Library called pytesseract.
My question is: Is this library safe to use with images containing confidential customer data? Do the images/the extracted text get saved in some server?
I found this link which suggests that it is. But I am lacking understandment of what exactly happens 'behind the scenes' everytime I read an image with the module.
Thanks in advance for any help!

How to read image files from a Camera (Nikon D5600) with Python

Forgive me if I've left anything out or goofed up formatting conventions; this is my first time posting on this sort of forum.
So I've got a Nikon D5600 that I'm using as part of an (extremely basic) image analysis setup. I'd like to be able to use images from it without having to manually transfer the files over each time I run a test, but I've had some trouble getting access to the files.
To be clear, I don't want to capture screenshots of a video; I understand that this is possible, but the resolution is about 1/3 smaller in video, which is a bit of an issue for my application.
So, when I was 6 hours more naive, I plugged in the camera via USB to my (Windows 10) desktop, tried calling the image using the exact (well, I did change the slashes out) file path windows gave me in the properties screen:
img = cv2.imread("This PC/D5600/Removable storage/DCIM/314D5600/CFW_0031.jpg")
That didn't work.
I checked that the command I was using wasn't the issue by copying the picture to another drive:
img = cv2.imread("D:/CFW_0031.jpg")
That worked.
So I think, and think is a bold claim here, that it's something to do with the "This PC" bit of the path. I've read some old (circa 2009) posts about MTP and such things, but I'm honestly not sure if that's even what this camera uses, or how to get started with that if it is in fact the correct protocol.
I've also tried using pygrabber (I believe it's a wrapper of direct show, though my terminology may be wrong) to control the camera via python, but that also didn't work, although I did manage to control my webcam, which was interesting.
Finally, I attempted to set the assign a letter drive to the camera, but found that the camera wasn't in the manager's list of discs. It's entirely possible I just did this method wrong, but I don't quite see how.
Edit regarding comment from Cristoph
-I just need to be able to use the image files in python, probably with opencv. I suppose that counts as reading them?
-I've attached a screenshot of what the "This PC" location looks like in the file explorer. The camera shows up under devices and drives, but doesn't have a drive letter.

Extremely new user to Python. "No module named request" error while trying code to detect image subdomains in a website to extract them to a folder

I may sound rather uninformed writing this, and unfortunately, my current issue may require a very articulate answer to fix. Therefore, I will try to be specific as possible as to ensure that my problem can be concisely understood.
My apologizes for that- as this Python code was merely obtained from a friend of mine who wrote it for me in order to complete a certain task. I myself had had extremely minimal programming knowledge.
Essentially, I am running Python 3.6 on a Mac. I am trying to work out a code that allows Python to scan through a bulk of a particular website's potentially existent subdomains in order to find possibly-existent JPG images files contained within said subdomains, and download any and all of the resulting found files to a distinct folder on my Desktop.
The Setup-
The code itself, named "download.py" on my computer, is written as follows:
import urllib.request
start = int(input("Start range:100000"))
stop = int(input("End range:199999"))
for i in range(start, stop + 1):
filename = str(i).rjust(6, '0') + ".jpg"
url = "http://website.com/Image_" + filename
urllib.request.urlretrieve(url, filename)
print(url)
(Note that the words "website" and "Image" have been substituted for the actual text included in my code).
Before I proceed, perhaps some explanation would be necessary.
Basically, the website in question contains several subdomains that include .JPG images, however, the majority of the exact URLs that allow the user to access these sub-domains are unknown and are a hidden component of the internal website itself. The format is "website.com/Image_xxxxxx.jpg", wherein x indicates a particular digit, and there are 6 total numerical digits by which only when combined to make a valid code pertain to each of the existent images on the site.
So as you can see, I have calibrated the code so that Python will initially search through number values in the aforementioned URL format from 100000 to 199999, and upon discovering any .JPG images attributed to any of the thousands of link combinations, will directly download all existent uncovered images to a specific folder that resides within my Desktop. The aim would be to start from that specific portion of number values, and upon running the code and fetching any images (or not), continually renumbering the code to work my way through all of the possible 6-digit combos until the operation is ultimately a success.
(Possible Side-Issue- Although I am fairly confident that my friend's code is written in a manner so that Python will only download .JPG files to my computer from images that actually do exist on that particular URL, rather than swarming my folder with blank/bare files from every single one of URL attempts regardless of whether that URL happens to be successful or not, I am admittedly not completely certain. If the latter is the case, informing me of a more suitable edit to my code would be tremendously appreciated.)
The Execution-
Right off the bat, the code experienced a large error. I'll list through the series of steps that led to the creation of said error.
#1- Of course, I first copy-pasted the code into a text document, and saved it as "download.py". I saved it inside of a folder named "Images" where I sought the images to be directly downloaded to. I used BBEdit.
#2- I proceeded, in Terminal, to input the commands "cd Desktop/Images" (to account for the file being held within the "Images" folder on my Desktop), followed by the command "Python download.py" (to actually run the code).
As you can see, the error which I obtained following my attempt to run the code was the ImportError: No module named request. Despite me guessing that the answer to solving this is simple, I can legitimately say I have got such minimal knowledge regarding Python that I've absolutely no idea how to solve this.
Hint: Prior to making the download.py file, the folder, and typing the Terminal code the only interactions I made with Python were downloading the program (3.6) and placing it in my toolbar. I'm not even quite sure if I am required to create any additional scripts/text files, or make any additional downloads before a script like this would work and successfully download the resulting images into my "Images" folder as is my desired goal. If I sincerely missed something integral at any point during this long read, hopefully, someone in here can provide a thoroughly detailed explanation as to how to solve my issue.
Finishing statements for those who've managed to stick along this far:
Thank you. I know this is one hell of a read, and I'm getting more tired as I go along. What I hope to get out of this question is
1.) Obviously, what would constitute a direct solution to the "No module named request" Input Error in Terminal. In other words, what I did wrong there or am missing.
2.) Any other helpful information that you know would assist this code, for example, if there is any integral step or condition I've missed or failed to meet that would ultimately cause the entirety of my code to cease to work. If you do see a fault in this, I only ask of you to be specific, as I've not got much experience in the programming world. After all, I know there is a lot of developers out here that are far more informed and experienced than am I. Thanks.
urllib.request is in Python 3 only. When running 'python' on a Mac, you're running Python 2 by default. Try running executing with python3.
python --version
might need to
brew install python3
urllib.request is a Python 3 construct. Most systems run Python 2 as default and this is what you get when you run simply python.
To install Python 3, go to https://brew.sh/ and follow the instructions to install the Hombrew package manager. Then run
brew install python3
python3 download.py

django/python: How to convert pptx/docx formats to PDF using python?

First of all, I agree that this might sound like a question which has already been asked many times in the past. However I couldn't find any answer that was relevant to me in the similar questions so I'll try to be more specific.
I would need to transform PPTX/DOCX files into PDF using Python but I don't have any experience in file format conversion. I have been looking in many places/forums/websites, read a lot of documentation and came across some useful libraries (python-pptx and pyPdf mainly), but I still don't know where to start.
When looking on the Internet, I can see many websites that offer file format conversions as a paying service, even with advanced API's: submit a file via POST and get the transformed PDF file in return. This could work for me, but I am really interested in writing myself the code that does the conversion work from OOXML to PDF.
How would you start doing this? Or is it just impossible on my own?
Thanks for your help!
After some research and with the help of python-pptx's creator, I was able to write to the PowerPoint COM interface using a Virtual Machine.
In case someone reads this thread, this is how I managed to get this done:
- Setup a VM with Microsoft Windows/Office installed on it ;
- Install Python, Django and win32com libraries on the VM.
The files are sent locally from the original Django project to the virtual machine (which are on the same network) through a simple POST request. The file is converted on the VM using win32com.client (which is just a simple call to the win32com.client library) and then sent back as a response to the original Django view, which in turn processes the response.
Note: it took me some time to realize I needed to use the #csrf_exempt decorator for this setup to work.

Any good pdf417 Barcode libraries for Python?

I'm looking for a good python module to generate pdf417 barcodes. Has anyone used one they liked?
Ideally I would like one with as few dependencies as possible, and one that runs on both linux and MacOSX.
We recently had to approach this problem as well, and being a Python shop we wanted a Python solution. It become clear the elaphe is the project that had the potential to actually accomplish pdf 417 barcode.
However what we found was it errors by todays standards, and so we entered the hunt to fix the library. Turns out elaphe must generate an outdated form of *.eps post script that can't be interpreted by ghost script and this is where the bar code generation fails.
Well fortunately elphae uses a common library behind the scenes called Barcode Writer in Pure PostScript # http://bwipp.terryburton.co.uk
This common backend library which has many projects in multi-languages using it to generate projects. The fix specifically for us was to fork elaphe, and correct it's *.eps file generation.
To determine what is broken in the *.eps, look at this other site that is made using postscriptbarcode, and it let's you generate the pdf417 barcode online (as well as other formats): http://www.terryburton.co.uk/barcodewriter/generator/
Once you generate a pdf417 barcode it gives you the option to download the .png, .jpg, and YES the .eps file!
Using this .eps file you can pipe it to ghost script and tweak the parameterization to get the exact pdf417 barcode you are looking for. Then take this result and integrate it into the elaphe library and actually get a pull request on that thing ....
Seems to be a bit of work, but nothing that can't be knocked out in an afternoon. It is ideal to get the elaphe library back in shape to generate these without making this enhancement.
Please note that the performance of this approach for us is a few seconds to generate this barcode due to the fact it creates the 2000 line eps file and pipes it to ghost script which generates another image file that we send back as the final barcode result. This is not as performance as code128 with reportlab.
Perhaps room for optimizations: Is pillow faster than PIL in anyway? Do we need all the parts of the eps file to generate the barcode of type pdf417? Other ways to optimize?
Anyway, great question Ken and I hope you find this to be a great answer.
I guess the issue in elaphe reported by Matteius in 2013 has been fixed, since the issues and commit logs show updates on the pdf417 topic since then.
Anyway, there are now a few other options (got the list with either pip search elaphe or pip search pdf417) :
elaphe ;
elaphe3 (fork of elaphe tested against python3) ;
candybar (no documentation ? also a webservice) ;
pdf417gen ;
treepoem (about the name : barcode -> bark ode -> tree poem =D ) — edit : didn't dig the issue, but as of today generation of PDF417 seems broken.
All but pdf417gen support several types of barcodes.
Note that the documentation of bwipp (on which are based elaphe and treepoem) only mentions 5 levels of error correction (1 to 5), while pdf417gen claims to support 9 security levels (0 to 8).
Reportlab does have an extension called rlbarcode, but this one does not include support for pdf417 codes. I do not know of any other extension for reportlab including support for pdf417 bar codes.
Anyway, if you are interested in generation of pdf417 codes from python, you may be interested in this project: elaphe.
I have still not tested it (in fact, I need to generate pdf417 from python, and I found this thread as well as the elaphe project page) I am going to download the elaphe tools in order to test it right now.

Categories