How to create a PDF from a binary string? - python

There is a request has been made to the server using Python's requests module:
requests.get('myserver/pdf', headers)
It returned a status-200 response, which all contains PDF binary data in response.content
Question
How does one create a PDF file from the response.content?

You can create an empty pdf then save write to that pdf in binary like this:
from reportlab.pdfgen import canvas
import requests
# Example of path. This file has not been created yet but we
# will use this as the location and name of the pdf in question
path_to_create_pdf_with_name_of_pdf = r'C:/User/Oleg/MyDownloadablePdf.pdf'
# Anything you used before making the request. Since you did not
# provide code I did not know what you used
.....
request = requests.get('myserver/pdf', headers)
#Actually creates the empty pdf that we will use to write the binary data to
pdf_file = canvas.Canvas(path_to_create_pdf_with_name_of_pdf)
#Open the empty pdf that we created above and write the binary data to.
with open(path_to_create_pdf_with_name_of_pdf, 'wb') as f:
f.write(request.content)
f.close()
The reportlab.pdfgen allows you to make a new pdf by specifying the path you want to save the pdf in along with the name of the pdf using the canvas.Canvas method. As stated in my answer you need to provide the path to do this.
Once you have an empty pdf, you can open the pdf file as wb (write binary) and write the content of the pdf from the request to the file and close the file.
When using the path - ensure that the name is not the name of any existing files to ensure that you do not overwrite any existing files. As the comments show, if this name is the name of any other file then you risk overwriting the data. If you are doing this in a loop for example, you will need to specify the path with a new name at each iteration to ensure that you have a new pdf each time. But if it is a one-off thing then you do not run that risk so as long as it is not the name of another file.

Related

How to trick camelot into reading from a byte-string rather than a real file

I was wondering if it was possible to read a pdf into camelot not by giving it the path of a file, but rather a binary string containing the PDF data.
The reason I want to do this is that I have PDFs inside a zip-File, and rather than extracting the content into a temporary directory I would like to pass the byte data directly into camelot.
So far I have tried the following:
import ZipFile
import Path
from io import BytesIO
import camelot
zipFileName = Path("file.zip") # containing the PDF file
pdf = ZipFile(zipFileName).read("path_to_zip.zip")
# pdf now contains the content of the PDF and starts like this:
# b'%PDF-1.4\n%\x80\x81\x82\...
f = BytesIO(pdf)
tables = camelot(f)
This will lead to an error messages
File ~/opt/anaconda3/envs/brd-test/lib/python3.9/site-packages/camelot/handlers.py:41, in PDFHandler.__init__(self, filepath, pages, password)
39 filepath = download_url(filepath)
40 self.filepath = filepath
---> 41 if not filepath.lower().endswith(".pdf"):
42 raise NotImplementedError("File format not supported")
44 if password is None:
AttributeError: '_io.BytesIO' object has no attribute 'lower'
Obviously camelot really wants to operate on a file and checks if the extension matches ".pdf".
Any suggestion how to trick camelot into accepting the content rather than the file path?
Someone wrote a PR to add this capability however camelot-py no longer seems to be maintained so it was never merged. You can pull and merge it locally to get the functionality. https://github.com/camelot-dev/camelot/pull/270

Why the close method is unknown when using map to close multiple files?

I have a use-case that resembles the following:
files = [open("foo1.pdf", "rb"), open("foo2.pdf", "rb"), open("foo3.pdf", "rb")]
# ... extract portions from the opened files using PyPDF2 and assemble a new PDF file
map(close, files)
Why I do that above? because while using PyPDF2 to merge multiple input PDF files into another file, if you close each input PDF you get empty pages in the output PDF. The input files have to stay open until the output PDF file is generated see https://github.com/mstamy2/PyPDF2/issues/293
Results in the following error:
NameError: name 'close' is not defined
The following works but I'd like the less verbose code variation:
map(lambda file: file.close(), files)
I'd of course prefer the following instead:
map(close, files)
Because close is, by itself, not a function, unlike open. Instead, it is a method of a file-like object.
If you want to programmatically close files, you can call close using the objects themselves.
for x in files:
x.close()
If you absolutely want to use map, you could use lambda function to do so, but I 'd recommend against that because map defines a generator, and hence it is unclear to the user which files are closed and which are open.
map(lambda x: x.close(), files)
Use contextlib.ExitStack to open your files and ensure that they are properly closed.
from contextlib import ExitStack
names = ["foo1.pdf", "foo2.pdf", "foo3.pdf"]
with ExitStack() as es:
files = [es.enter_context(open(f, "rb")) for f in names]
# ... extract portions from the opened files using PyPDF2 and assemble a new PDF file
# proceed with the new PDF file

Convert pdf files to raw text in new directory

Here is what I'm trying:
import PyPDF2
from PyPDF2 import PdfFileWriter, PdfFileReader
import re
import config
import sys
import os
with open(config.ENCRYPTED_FILE_PATH, mode='rb') as f:
reader = PyPDF2.PdfFileReader(f)
if reader.isEncrypted:
reader.decrypt('Password123')
print(f"Number of page: {reader.getNumPages()}")
for i in range(reader.numPages):
output = PdfFileWriter()
output.addPage(reader.getPage(i))
with open("./pdfs/document-page%s.pdf" % i, "wb") as outputStream:
output.write(outputStream)
print(outputStream)
for page in output.pages: # failing here
print page.extractText() # failing here
The entire program is decrypting a large pdf file from one location, and splitting into a separate pdf file per page in new directory -- this is working fine. However, after this I would like to convert each page to a raw .txt file in a new directory. i.e. /txt_versions/ (for which I'll use later)
Ideally, I can use my current imports, i.e. PyPDF2 without importing/installing more modules/. Any thoughts?
You have not described how the last two lines are failing, but extract text does not function well on some PDFs:
def extractText(self):
"""
Locate all text drawing commands, in the order they are provided in the
content stream, and extract the text. This works well for some PDF
files, but poorly for others, depending on the generator used. This will
be refined in the future. Do not rely on the order of text coming out of
this function, as it will change if this function is made more
sophisticated.
:return: a unicode string object.
"""
One thing to do is to see if there is text in your pdf. Just because you can see words doesn't mean they have been OCR'd or otherwise encoded in the file as text. Attempt to highlight the text in the pdf and copy/paste it into a text file to see what sort of text can even be extracted.
If you can't get your solution working you'll need to use another package like Tika.

How can I create a word (.docx) document if not found using python and write in it?

How can I create a word (.docx) document if not found using python and write in it?
I certainly cannot do either of the following:
file = open(file_name, 'r')
file = open(file_name, 'w')
or, to create or append if found:
f = open(file_name, 'a+')
Also I cannot find any related info in python-docx documentation at:
https://python-docx.readthedocs.io/en/latest/
NOTE:
I need to create an automated report via python with text and pie charts, graphs etc.
Probably the safest way to open (and truncate) a new file for writing is using 'xb' mode. 'x' will raise a FileExistsError if the file is already there. 'b' is necessary because a word document is fundamentally a binary file: it's a zip archive with XML and other files inside it. You can't compress and decompress a zip file if you convert bytes through character encoding.
Document.save accepts streams, so you can pass in a file object opened like that to save your document.
Your work-flow could be something like this:
doc = docx.Document(...)
...
# Make your document
...
with open('outfile.docx', 'xb') as f:
doc.save(f)
It's a good idea to use with blocks instead of raw open to ensure the file gets closed properly even in case of an error.
In the same way that you can't simply write to a Word file directly, you can't append to it either. The way to "append" is to open the file, load the Document object, and then write it back, overwriting the original content. Since the word file is a zip archive, it's very likely that appended text won't even be at the end of the XML file it's in, much less the whole docx file:
doc = docx.Document('file_to_append.docx')
...
# Modify the contents of doc
...
doc.save('file_to_append.docx')
Keep in mind that the python-docx library may not support loading some elements, which may end up being permanently discarded when you save the file this way.
Looks like I found an answer:
The important point here was to create a new file, if not found, or
otherwise edit the already present file.
import os
from docx import Document
#checking if file already present and creating it if not present
if not os.path.isfile(r"file_path"):
#Creating a blank document
document = Document()
#saving the blank document
document.save('file_name.docx')
#------------editing the file_name.docx now------------------------
#opening the existing document
document = Document('file_name.docx')
#editing it
document.add_heading("hello world" , 0)
#saving document in the end
document.save('file_name.docx')
Further edits/suggestions are welcome.

Python basics - request data from API and write to a file

I am trying to use "requests" package and retrieve info from Github, like the Requests doc page explains:
import requests
r = requests.get('https://api.github.com/events')
And this:
with open(filename, 'wb') as fd:
for chunk in r.iter_content(chunk_size):
fd.write(chunk)
I have to say I don't understand the second code block.
filename - in what form do I provide the path to the file if created? where will it be saved if not?
'wb' - what is this variable? (shouldn't second parameter be 'mode'?)
following two lines probably iterate over data retrieved with request and write to the file
Python docs explanation also not helping much.
EDIT: What I am trying to do:
use Requests to connect to an API (Github and later Facebook GraphAPI)
retrieve data into a variable
write this into a file (later, as I get more familiar with Python, into my local MySQL database)
Filename
When using open the path is relative to your current directory. So if you said open('file.txt','w') it would create a new file named file.txt in whatever folder your python script is in. You can also specify an absolute path, for example /home/user/file.txt in linux. If a file by the name 'file.txt' already exists, the contents will be completely overwritten.
Mode
The 'wb' option is indeed the mode. The 'w' means write and the 'b' means bytes. You use 'w' when you want to write (rather than read) froma file, and you use 'b' for binary files (rather than text files). It is actually a little odd to use 'b' in this case, as the content you are writing is a text file. Specifying 'w' would work just as well here. Read more on the modes in the docs for open.
The Loop
This part is using the iter_content method from requests, which is intended for use with large files that you may not want in memory all at once. This is unnecessary in this case, since the page in question is only 89 KB. See the requests library docs for more info.
Conclusion
The example you are looking at is meant to handle the most general case, in which the remote file might be binary and too big to be in memory. However, we can make your code more readable and easy to understand if you are only accessing small webpages containing text:
import requests
r = requests.get('https://api.github.com/events')
with open('events.txt','w') as fd:
fd.write(r.text)
filename is a string of the path you want to save it at. It accepts either local or absolute path, so you can just have filename = 'example.html'
wb stands for WRITE & BYTES, learn more here
The for loop goes over the entire returned content (in chunks incase it is too large for proper memory handling), and then writes them until there are no more. Useful for large files, but for a single webpage you could just do:
# just W becase we are not writing as bytes anymore, just text.
with open(filename, 'w') as fd:
fd.write(r.content)

Categories