Unittesting pdf generated from website

Unittesting pdf generated from website - python

I'm writing a package which is used for generating pdf files, by posting some data to a website and retrieving a generated pdf from the data.
My problem is with the unittests. So fare I've tried to post a known dataset to the website, retrieving the pdf and compared it to a pdf which I know is good. This works fine, however there's a timestamp in the pdf which means that next day it doesn't work.
As I can see it i have three options.
One is to get rid of the timestamp in the pdf. This seems to be pretty difficult from my googling. It would probably be something like a pdf to image conversion, and then blanking out the timestamp. And then comparing to a reference file.
Option two would be to create a mock website, which i can then use for generating a mock pdf. This options seems a bit strange to me though - as I would then not test the actual connection to the website, and if I ruin something in the connection, I wouldn't catch the bug.
And three would be to just check that I retrieve some data which appears to be a pdf, and then be done with it. This way I would also get around if the website changes a comma in the generated pdf.
So, I guess my question is two-fold. 1: How difficult would the pdf to image to blanking method be, and 2: From a unittesting perspective, would it be a better approach to make a mock website or just test that I get some pdf-like data.

option 4: figure out where the time stamp lives in the pdf, and compare the bytes before and after
For example, if the time stamp is at offset 11 and is two bytes long:
with open('reference.pdf') as rf:
reference_data = rf.read()
with open('pdf_from_website.pdf') as wf:
website_data = wf.read()
self.assertEqual(reference_data[:11], website_data[:11])
self.assertEqual(reference_data[13:], website_data[13:])
I'm not familiar with the innards of pdf files so this might not work. You could use diff to see where the differences are and try, though.
For your second question: It is best if you can test that the returned pdf is both valid and has the contents it should have.

Related

Overwriting files; changing the content but keeping the filename

This'll be a bit long winded but might be best to explain the scenario first...
We have a number of BI visualizations that are generated each month for management reporting. Just over 400 images are taken each month and automatically placed in a directory using WKHTMLTOIMAGE. These images are automatically updated in to various PowerPoint presentations and emailed off to the relevant teams. All of this "generally" works fine and has removed much of the tedious manual work.
The problem occurs when one of these visualizations fails to update. At the moment there is no way of checking, other than to open up each visualization and compare it to the image that has just been extracted.
If 399 of the 400 images work, and the 400th doesn't, PowerPoint would still be populated using the previously loaded (400th) image due to the way the "Link to File" function works in PowerPoint.
What I'd like to do is use an example image (check.jpg) to overwrite all of the existing images but still keeping their original file names. That way when the monthly report is run if one of them doesn't work the PowerPoint would have still been updated with this check.jpg image which would stand out as something we would need to rerun manually.
I can't seem to find anything along the lines of what I'm looking for. I can list all of the filenames, move them, overwrite them etc but not sure how I'd do it (or even if it's the right way to do it) with the scenario I'm thinking of. If someone could point me in the right direction, that'd be great. Thank you.

Opening a file for writing doesn't change the filename:
with open("path/to/check.jpg", "rb") as src, open("path/to/image.jpg", "wb") as dest:
dest.write(src.read())

Converting PDF to any parse-able format

I have a PDF file which consists of tables which can spread across various pages and may have text in between. An example of it can be found here.
I am able to convert the PDF to any format but the output files are not in any way parse-able i.e. I cannot extract data out of it as they are scattered. Here are the links to the output files which I created using pdftotext and pdftohtml.
Is there a way to extract data in a more suitable way?
Thanks in advance.

The general answer is no. pdf is a format intended for visual presentation and printing, and there is no guarantee that the contents will be in any particular order let alone structured as a table in any way other than what appears when the pdf is rendered onto paper or a screen. Sometimes there is even deliberate obfuscation to prevent anyone doing what you are attempting.
In this case it appears to be possible to cut and paste the contents of each table element. For a small number of similar files that is almost certainly the quickest thing to do. Open the pdf on the left hand of your screen, a spreadsheet or data-entry program on the right hand, then cut and paste. For a medium number - tens, hundreds? - it's probably cheapest to hire a temp to do the donkey-work. For a large number - thousands? - it would be possible to create a program to automate this process, but definitely not easy. I might think about using human input via the mouse to identify the corners of the table and the horizontal / vertical divisions, then generating cut and paste operations via control of the human interface devices. Don't ask me how. I'd have to find out if I had to do this, and I'd much rather not. It's a WOMBAT.
Whatever form of analysis you did on the pdf contents would certainly not generalize to other pdfs created by different organisations using different software, and possibly not even by the same organisation using the same process but merely a later release of the same software.

Following in the line of #nigel222, it really depends on the PDF how easily you can get the data out in some useful way.
It is best if the PDF is structured (has a document structure, created when the PDF was written). In this case, you can access the structure, and you are all set.
As structure is a fundamental necessity of an accessible PDF, you may try to "massage" the document by applying the various "make accessible" utilities floating around; definitely something to follow.

create pdf from python

I'm looking to generate PDF's from a Python application.
They start relatively simple but some may become more complex (Essentially letter like documents but will include watermarks for example later)
I've worked in raw postscript before and providing I can generate the correct headers etc and file at the end of it I want to avoid use of complex libs that may not do entirely what I want. Some seem to have got bitrot and no longer supported (pypdf and pypdf2) Especially when I know PDF/Postscript can do exactly what I need. PDF content really isn't that complex.
I can generate EPS (Encapsulated postscript) fine by just writing the appropriate text headers to file and my postscript code. But Inspecting PDF's there is a lil binary header I'm not sure how to generate.
I could generate an EPS and convert it. I'm not overly happy with this as the production environment is a Windows 2008 server (Dev is Ubuntu 12.04) and making something and converting it seems very silly.
Has anyone done this before?
Am I being pedantic by not wanting to use a library?

borrowed from ask.yahoo
A PDF file starts with "%PDF-1.1" if it is a version 1.1 type of PDF file. You can read PDF files ok when they don't have binary data objects stored in them, and you could even make one using Notepad if you didn't need to store a binary object like a Paint bitmap in it.
But after seeing the "%PDF-1.1" you ignore what's after that (Adobe Reader does, too) and go straight to the end of the file to where there is a line that says "%%EOF". That's always the last thing in the file; and if that's there you know that just a few characters before that place in the file there's the word "startxref" followed by a number. This number tells a reader program where to look in the file to find the start of the list of items describing the structure of the file. These items in the list can be page objects, dictionary objects, or stream objects (like the binary data of a bitmap), and each one has "obj" and "endobj" marking out where its description starts and ends.
For fairly simple PDF files, you might be able to type the text in just like you did with Notepad to make a working PDF file that Adobe Reader and other PDF viewer programs could read and display correctly.
Doing something like this is a challenge, even for a simple file, and you'd really have to know what you're doing to get any binary data into the file where it's supposed to go; but for character data, you'd just be able to type it in. And all of the commands used in the PDF are in the form of strings that you could type in. The hardest part is calculating those numbers that give the file offsets for items in the file (such as the number following "startxref").
If the way the file format is laid out intrigues you, go ahead and read the PDF manual, which tells the whole story.
http://www.adobe.com/content/dam/Adobe/en/devnet/acrobat/pdfs/PDF32000_2008.pdf
but really you should probably just use a library
Thanks to #LukasGraf for providing this link http://www.gnupdf.org/Introduction_to_PDF that shows how to create a simple hello world pdf from scratch

As long as you're working in Python 2.7, Reportlab seems to be the best solution out there at the moment. It's quite full-featured, and can be a little complex to work with, depending on exactly what you're doing with it, but since you seem to be familiar with PDF internals in general hopefully the learning curve won't be too steep.

I recommend you to use a library. I spent a lot of time creating pdfme and learned a lot of things along the way, but it's not something you would do for a single project. If you want to use my library check the docs here.

Processing a PDF for information extraction

I am working on a project where I have a pdf file which describes one of the health policy. What I need to do is extract the information from this PDF and try to save it in some form such that I can answer the questions related to the policy by extracting info from this PDf.
This PDF is too big, so I want to divide the PDF according to the different sections so that when a query related to some particular area comes in then I wont have to go through the entire document.
I tried solving this using some pdf converters which converts the PDFs into the HTMLs. But these converters wont convert the PDF to HTML properly so that headings will have heading tag. Also even if I convert this properly and get the proper sections out of the document, I am not getting how to store this data.(I mean in which form should I store this Data).
Is there any other solution with which I can achieve this. I am using Python and also I can use NLTK if needed. Also the format is not fixed for the PDfs, I mean to say my code should work on any kind of PDFs.

PDFMiner is great in that it has location for every bit of text it gets from the PDF. It won't be nicely put in header tags or anything like that, but if you have a consistent PDF structure in your docs you might be able to get something working.

want to add url links to .csv datafeed using python

ive looked through the current related questions but have not managed to find anything similar to my needs.
Im in the process of creating a affiliate store using zencart - now one of the issues is that zencart is not designed for redirects and affiliate stores but it can be done. I will be changing the store so it acts like a showcase store showing prices.
There is a mod called easy populate which allows me to upload datafeeds. This is all well and good however my affiliate link will not be in each product. I can do it manually after uploading the data feed and going to each product and then adding it as an image with a redirect link - However when there are over 500 items its going to be a long repetitive and time consuming job.
I have been told that I can add the links to the data feed before uploading it to zencart and this should be done using python. Ive been reading about python for several days now and feel im looking for the wrong things. I was wondering if someone could please advise the simplest way for me to get this done.
I hope the question makes sense
thanks
abs

You could craft a python script using csv module like this:
>>> import csv
>>> cartWriter = csv.writer(open('yourcart.csv', 'wb'))
>>> cartWriter.writerow(['Product', 'yourinfo', 'yourlink'])
You need to know how link should be formatted hoping that it could be composed using the other parameters present on csv file.

First, use the CSV module as systempuntoout told you, secondly, you will want to change your header to:
mimetype='text/csv'
Content-Disposition = 'attachment; filename=name_of_your_file.csv'
The way to do it depends very much of your website implementation. In pure Python you would probably do that with an HttpResponse object. In django, as well, but there are some shortcuts.
You can find a video demonstrating how to create CSV files with Python on showmedo. It's not free however.
Now, to provide a link to download the CSV, this depends of your Website. What is the technology behinds it : pure Python, Django, Pylons, Tubogear ?
If you can't answer the question, you should ask your boss a training about your infrastructure before trying to make change to it.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.