I am trying to write some code that will convert a PDF that resides on the web into a series of jpgs.
I got working code that:
1) takes pdf
2) saves it to disk
3) converts it to JPGs, which are saved to disk.
Is there a way to write the same code (attempt at code below, that throws an error), that would take the PDF from internet, but keep it in memory (to keep the program from writing to disk/reading from disk), then convert it to JPGs (which are to be uploaded to AWS s3)?
I was thinking this would work:
f = urlopen("https://s3.us-east-2.amazonaws.com/converted1jpgs/example.pdf") #file to process
But i get the following error:
"Exception TypeError: TypeError("object of type 'NoneType' has no len()",) in > ignored"
Full code, along with proper PDF file that i want converted. Note: the code works if i replace f= with the location of a PDF saved on disk:
from urllib2 import urlopen
from wand.image import Image
#location on disk
save_location = "/home/bob/Desktop/pdfs to convert/example1"
#file prefix
test_id = 'example'
print 1
f = urlopen("https://s3.us-east-2.amazonaws.com/converted1jpgs/example.pdf")
print 2
print type(f)
with Image(filename=f) as img:
print('pages = ', len(img.sequence))
with img.convert('jpg') as converted:
converted.save(filename=save_location+"/"+test_id+".jpg")
The result of urlopen obviously isn't a filename, so you can't pass in filename=f and expect it to work.
I don't have Wand installed, but from the docs, there are clearly a bunch of alternative ways to construct it.
First, urlopen is a file-like object. Of course "file-like object" is a somewhat vague term, and not all file-like objects work for all APIs that expect file-like objects (e.g., the API may expect to be able to call fileno and read from it at the POSIX level…), but this is at least worth trying (note file instead of filename):
with Image(file=f) as img:
If that doesn't work, you can always read the data into memory:
buf = f.read()
with Image(blob=buf) as img:
Not as ideal (if you have giant files), but at least you don't have to store it on disk.
Related
I'm trying to read a PDF file extracted from a zip file in memory to get the tables inside the file. Camelot seems a good way to do it, but I'm getting the following error:
AttributeError: '_io.StringIO' object has no attribute 'lower'
Is there some way to read the file and extract the tables with camelot, or should I use another library?
z = zipfile.ZipFile(self.zip_file)
for file in z.namelist():
if file.endswith(".pdf"):
pdf = z.read(file).decode(encoding="latin-1")
pdf = StringIO(pdf)
pdf = camelot.read_pdf(pdf, codec='utf-8')
camelot.read_pdf(filepath,...)
Accepts a file path as the first parameter. It appears to be a bad match for your requirements. Search for another library.
In any case StringIO(pdf), will return the following:
<_io.StringIO object at 0x000002592DD33E20>
For starters, when you read a file from StringIO, do it by calling the read() function
pdf = StringIO(pdf)
pdf.read()
That bit will indeed return the file bytes themselves. Next think about the encoding that the library will accept.
I'm importing an image using the following code:
files = {
'file': open(r'C:/Users/jared/Deblur Project/curl requests/test.jpg', 'rb'),
}
response = requests.post('http://localhost:5000/net/image/evaluate_local', files=files)
print(response)
This sends 'test.jpg' over to the following route:
#app.post("/net/image/evaluate_local")
async def get_net_image_evaluate_local(file: UploadFile = File(...)):
image_path = file.read()
threshold = 0.75
model_path = "model.tflite"
prediction = analyze_images(model_path, image_path, threshold)
return prediction
Obviously, image_path = file.read() is not working, but it's showcasing what I'm trying to do. I need to provide an image path to the analyze_images() function, but I'm not exactly sure how to do so.
If I cannot provide it as a path, I am also trying to provide it as raw bytes array for the model to use.
image_path = file.read()
returns
b'\xff\xd8\xff\xe0\x00\x10JFIF\x00\x01\x01\x00\x00\x01\x00\x01\x00\x00\xff\xe2\x02(ICC_PROFILE\x00\x01\x01\x00\x00\x02\x18\x00\x00\x00\x00\x02\x10\x00\x00mntrRGB XYZ \x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00acsp\x00\x00\x00\x00\x00\x00...
which I am also unsure how to work with.
Anyone have any advice on how to proceed?
Unsure how to make it work as a file path, but I did manage to get it to work. Luckily the model I'm working with takes in those image bytes ( as returned by file.read() ).
So if I had image = file.read(), I could feed that image into a function that took in image as a variable, and then use it as follows:
img = np.uint8(tf.image.resize(tf.io.decode_image(image), [width, height], method=tf.image.ResizeMethod.BILINEAR))
There is no immediate file path because when you use FastAPI's/Starlette's UploadFile,
It uses a "spooled" file:
A file stored in memory up to a maximum size limit, and after passing this limit it will be stored in disk.
The underlying implementation is actually from Python's standard tempfile module for generating temporary files and folders. See the tempfile.SpooledTemporaryFile section:
This class operates exactly as TemporaryFile() does, except that data is spooled in memory until the file size exceeds max_size, or until the file’s fileno() method is called, at which point the contents are written to disk and operation proceeds as with TemporaryFile().
If you can work instead with the raw bytes of the image (as what you did using .read() in your other answer), then I think that's the better approach, as most image processing starts with the images bytes anyway. (Make sure to use await appropriately if you are calling the async methods!)
An alternative is, if a function expects a "file-like" object, then you can pass in the UploadFile.file object itself, which is the SpooledTemporaryFile object, which is the
... actual Python file that you can pass directly to other functions or libraries that expect a "file-like" object.
If you really need a file on disk and get a path to it, you can write the contents to a NamedTemporaryFile' which
... is guaranteed to have a visible name in the file system (on Unix, the directory entry is not unlinked). That name can be retrieved from the name attribute of the returned file-like object.
...
The returned object is always a file-like object whose file attribute is the underlying true file object. This file-like object can be used in a with statement, just like a normal file.
#app.post("/net/image/evaluate_local")
async def get_net_image_evaluate_local(file: UploadFile = File(...)):
file_suffix = "".join(file.filename.partition(".")[1:])
with NamedTemporaryFile(mode="w+b", suffix=file_suffix) as file_on_disk:
file_contents = await file.read()
file_on_disk.write(file_contents)
image_path = file_on_disk.name
print(image_path)
threshold = 0.75
model_path = "model.tflite"
prediction = analyze_images(model_path, image_path, threshold)
return prediction
On my macOS, image_path prints out something like
/var/folders/3h/pdjwtnlx4p13chnw21rvwbtw0000gp/T/tmp1v52fm95.png
and that file would be available as long as the file is not yet closed.
I'm currently working on a simple proof of concept for a pdf-editor application. The example is supposed to be a simplified python script showcasing how we could use the pdfrw library to edit PDF files with forms in them.
So, here's the issue. I'm not interested in writing the edited PDF to a file.
The idea is that file opening and closing is going to most likely be handled by external code and so I want all the edits in my files to be done in memory. I don't want to write the edited filestream to a local file.
Let me specify what I mean by this. I currently have a piece of code like this:
class FormFiller:
def __fill_pdf__(input_pdf_filestream : bytes, data_dict : dict):
template_pdf : pdfrw.PdfReader = pdfrw.PdfReader(input_pdf_filestream)
# <some editing magic here>
return template_pdf
def fillForm(self,mapper : FieldMapper):
value_mapping : dict = mapper.getValues()
filled_pdf : pdfrw.PdfReader = self.__fill_pdf__(self.filesteam, value_mapping)
#<this point is crucial>
def __init__(self, filestream : bytes):
self.filesteam : bytes = filestream
So, as you see the FormFiller constructor receives an array of bytes. In fact, it's an io.BytesIO object. The template_pdf variable uses a PdfReader object from the pdfrw library. Now, when we get to the #<this point is crucial> marker, I have a filled_pdf variable which is a PdfReader object. I would like to convert it to a filestream (a bytes array, or an io.BytesIO object if you will), and return it in that form. I don't want to write it to a file. However, the writer class provided by pdfrw (pdfrw.PdfWriter) does not allow for such an operation. It only provides a write(<filename>) method, which saves the PdfReader object to a pdf output file.
How should I approach this? Do you recommend a workaround? Or perhaps I should use a completely different library to accomplish this?
Please help :-(
To save your altered PDF to memory in an object that can be passed around (instead of writing to a file), simply create an empty instance of io.BytesIO:
from io import BytesIO
new_bytes_object = BytesIO()
Then, use pdfrw's PdfWriter.write() method to write your data to the empty BytesIO object:
pdfrw.PdfWriter.write(new_bytes_object, filled_pdf)
# I'm not sure about the syntax, I haven't used this lib before
This works because io.BytesIO objects act like a file object, also known as a file-like object. It and related classes like io.StringIO behave like files in memory, such as the object f created with the built-in function open below:
with open("output.txt", "a") as f:
f.write(some_data)
Before you attempt to read from new_bytes_object, don't forget to seek(0) back to the beginning, or rewind it. Otherwise, the object seems empty.
new_bytes_object.seek(0)
I'm working with Python 3, Django and the xhtml2pdf package.
I want to create a PDF from an HTML string, but I don't want to write the PDF on disk, but rather just to get its bytes from memory, as in using BytesIO or StringIO.
I've read the xhtml2pdf documentation. This is the closest I've found related to what I need:
In-memory files can be generated by using StringIO or cStringIO instead of the file open. Advanced options will be discussed later in this document.
And this is one of the latest things I've tried:
def html_to_pdf(html):
"""Writes a PDF file using xhtml2pdf from a given HTML stream
Parameters
----------
html : str
A HTML valid string.
Returns
-------
bytes
A bytes sequence containing the rendered PDF.
"""
output = BytesIO()
pisa_status = pisa.CreatePDF(html, dest=output)
return new_output.read()
But this isn't working.
Any idea how to output the generated PDF as a in-memory object and thus return its bytes?
I think your return statement is using new_output instead of output.
However, the real issue should be something else, have you tried calling output.seek(0) before reading its bytes with output.read()?
What you can also do is output.getvalue(). This will get the entire contents of the BytesIO object.
I am learning machine learning and data analysis on wav files.
I know if I have wav files directly I can do something like this to read in the data
import librosa
mono, fs = librosa.load('./small_data/time_series_audio.wav', sr = 44100)
Now I'm given a gz-file "music_feature_extraction_test.tar.gz"
I'm not sure what to do now.
I tried:
with gzip.open('music_train.tar.gz', 'rb') as f:
for files in f :
mono, fs = librosa.load(files, sr = 44100)
but it gives me:
TypeError: lstat() argument 1 must be encoded string without null bytes, not str
Can anyone help me out?
There are several things going on:
The file you are given is a gzipped-compressed tarball. Take a look at the tarfile module, it can read gzip-compressed files directly. You'll get an iterator over it's members, each of which is an individual file.
AFAIKS librosa can't read from an in-memory buffer so you have to unpack the tar-members to temporary files. The tempfile-module is your friend here, a NamedTemporaryFile will provide you with a self-deleting file that you can uncompress to and provide to librosa.
You probably want to implement this as a simple generator function that takes the tarfile-name as it's input, iterates over it's members and yields what librosa.load() provides you. That way everything gets cleaned up automatically.
The basic loop would therefore be
Open the tarball using the tarfile-module. For each member
Get a new temporary file using NamedTemporaryFile. Copy the content of the tarball-member to that file. You may want to use shutil.copyfileobj to avoid reading the entire wav-file into memory before writing it to disk.
The NamedTemporaryFile has a filename-attribute. Pass that to librosa.open.
yield the return value of librosa.open to the caller.
You can use PySoundFile to read from the compressed file.
https://pysoundfile.readthedocs.io/en/0.9.0/#virtual-io
import soundfile
with gzip.open('music_train.tar.gz', 'rb') as gz_f:
for file in gz_f :
fs, mono = soundfile.read(file, samplerate=44100)
Maybe you should also check if you need to resample the data before processing it with librosa:
https://librosa.github.io/librosa/ioformats.html#read-specific-formats