Python Camelot - export one PDF file to one converted file - python

Python 3.7 with Camelot 0.7.3.
By default, Camelot exports separate converted files for each page of the pdf file. I need it so that one pdf file exports to one converted file (HTML conversion is what we use), regardless of how many pages the pdf file is. The documentation does not cover this scenario. Is there a way to achieve this without using compress=true? A zip file will not work in our application.

Related

Generic way to display any file from GridFS

Given a file from GridFS, I'd like to be able to display it on a webpage.
The files in my database can be of any common type, including jpgs, pngs, xml, txt, csv, etc.
A user would like to be able to click on the name of the file and in a new tab the file is displayed whether it is an image or text file, or click download and download the file with its original extension.
The application is in Python. I have seen some solution on here, but they require reading the bytes into a buffer, concatenating, and formatting some markup for an image with the bytes as a base64 string and require that the programmer knows what the extension of the file is and for the code to handle and format each extension case separately.

How to load in Python an xlsx that originally had .xls file extension?

I'm using xlrd to process .xls files, and openpyxl to process .xlsx files, and this is working well.
Then I'm handed what is ostensibly a .xls file, so I try to xlrd.open_workbook(), and get:
XLRDError: Unsupported format, or corrupt file: Expected BOF record; found '<?xml ve'
I take a look at this question, and I surmise that my file, although ending with extension .xls, must actually be a .xlsx. And indeed, I can view it in a text editor:
<?xml version="1.0" encoding="UTF-8"?>
<Workbook xmlns="urn:schemas-microsoft-com:office:spreadsheet"
xmlns:x="urn:schemas-microsoft-com:office:excel"
xmlns:ss="urn:schemas-microsoft-com:office:spreadsheet"
xmlns:html="http://www.w3.org/TR/REC-html40">
:
:
:
(for privacy reasons, I can't post the whole file, but it's probably not required for our analysis).
So I surmise that if I just copy (cp) it to a .xlsx, I should be able to open it with openpyxl.load_workbook(), but I get:
BadZipfile: File is not a zip file
If it's actually an xls (unlikely) but can't be opened with xlrd, and if it is atcually an xlsx but can't be opened with openpyxl, even after I cp it to a .xlsx, what to do?
Note: If I open up the .xls in Excel, save it as a .xlsx, and retry with openpyxl, it does load fine, but this manual step is not a luxury I will have in the executing of my program.
One thing is clear: The file you're trying to open has a different format than its extension suggests.
As you already know, Excel file formats include (but are not limited to) xls and xlsx.
The Excel 2003 format (xls) is a binary format. This means that if you open a xls file with a text editor, you'll just see gibberish.
The Excel 2007 format (xlsx) is quite different. A xlsx file is a zip file with a bunch of XML files inside. You can use a zip archiver to extract the contents of the xlsx file. Then, you can edit the XML files using any text editor. However, opening a xlsx file directly with a text editor is like opening a zip file with a text editor: You'll just see gibberish.
The fact that you can open your file with a text editor (and read its contents) shows that it's neither a xls file nor a xlsx file. Your file is neither a binary file nor a zip file, it's a plain XML file.
Moreover, this error message says a lot.
BadZipfile: File is not a zip file
It means that openpyxl is trying to open your file as a xlsx file and therefore a zip file. But when it tries to extract its contents, it fails, because your file isn't even a zip file.
But if the file is neither a xlsx file nor a xls file, how can Microsoft Excel read it? I wondered that too. After some research, I believe your file has the XML Spreadsheet 2003 file format. This example looks very similar to the file content you posted. Since Microsoft Excel supports this format, it's no wonder that it can read your file.
Unfortunately, Python libraries such as xlrd and openpyxl only support xls and xlsx file formats, so they won't be able to read your file. I think you'll just have to manually convert it to a supported format.
I am not on OSX, so this is not tested. You may be able to use the appscript package, despite it's lack of support, to open the offending file and the resave it.
from appscript import *
excel = app('Microsoft Excel')
wb = excel.open('/path/to/file.xls')
wb.save_as('/path/to/fileout.xlsx', file_format=k.XLSX_file_format)
#not sure the exact name of k.excel_file
I had a similar problem. It turned out that it needed the absolute file path. E.g., "c:/dir/filename.xlsx" instead of "filename.xlsx". Relative paths worked on osx, but not on Windows.

Python : Convert multiple images as multiple pages in pdf for windows

How to convert multiple images(jpeg) as a pdf file with multiple pages in windows.
Using Image library, i can convert every image as single pdf, i can merge those converted files to a single pdf file using pdfminer, but it is two way work.
I try to download MagicK, but couldn't get binary for windows. Is it possible to achieve using PIL ?
I'm not totally sure, but you can create a report with jasperReport and create a pdf file after. I believe python also can work with jasper reports.
what do you think? maybe is too much work.

How to convert .xls file to pdf through python scripting

I have created .xls using xlwt module.
Now i want to convert this newly created xls file to pdf. Can anyone please tell me how can i do this through python scripting. Am using python 2.6 version.
There are at least two possibilities here: either you read the .xls back in as cells and generate a PDF, or maybe you want to print the .xls to .pdf from the program that just created the .xls.
In the first case you might want to consider writing the .pdf directly (you have the data to make the .xls in the program at some point). I have a grid class that can be written out in HTML format, or to .xls' (usingxlwt),.xlsm(usingopenpyxl) and to.pdf(usingreportlab`).
You will need to use the second option if you write formula and that kind of stuff in your .xls and want to have the calculated results in your .pdf. In that case use subprocess to call Microsoft Excel or OpenOffice/LibreOffice Calc with the right commandline parameters for printing/converting the file to PDF.
E.g. for LibreOffice 4.0 the commandline would be:
scalc --convert-to pdf yourfile.xls
which will result in a yourfile.pdf

how to read ppt file using python?

I want to get the content (text only) in a ppt file. How to do it?
(It likes that if I want to get content in a txt file, I just need to open and read. What do I need to do to get information from ppt files?)
By the way, I know there is a win32com in windows system. But now I am working on linux, is there any possible way?
I found this discussion over on Superuser:
Command line tool in Linux to Extract Text From Word, Excel, Powerpoint?
There are several reasonable answers listed there, including using LibreOffice to do this (and for .doc, .docx, .pptx, etc, etc.), and the Apache Tika Project (which appears to be the 5,000lb gorilla in this solution space).

Categories