Overwriting files; changing the content but keeping the filename - python

This'll be a bit long winded but might be best to explain the scenario first...
We have a number of BI visualizations that are generated each month for management reporting. Just over 400 images are taken each month and automatically placed in a directory using WKHTMLTOIMAGE. These images are automatically updated in to various PowerPoint presentations and emailed off to the relevant teams. All of this "generally" works fine and has removed much of the tedious manual work.
The problem occurs when one of these visualizations fails to update. At the moment there is no way of checking, other than to open up each visualization and compare it to the image that has just been extracted.
If 399 of the 400 images work, and the 400th doesn't, PowerPoint would still be populated using the previously loaded (400th) image due to the way the "Link to File" function works in PowerPoint.
What I'd like to do is use an example image (check.jpg) to overwrite all of the existing images but still keeping their original file names. That way when the monthly report is run if one of them doesn't work the PowerPoint would have still been updated with this check.jpg image which would stand out as something we would need to rerun manually.
I can't seem to find anything along the lines of what I'm looking for. I can list all of the filenames, move them, overwrite them etc but not sure how I'd do it (or even if it's the right way to do it) with the scenario I'm thinking of. If someone could point me in the right direction, that'd be great. Thank you.

Opening a file for writing doesn't change the filename:
with open("path/to/check.jpg", "rb") as src, open("path/to/image.jpg", "wb") as dest:
dest.write(src.read())

Related

Database searches in separate files

I am looking for a kind of database which can search in separate files eg. pdf, xls, doc that I get from different suppliers. My idea is something like this:
For example, I need to search for a part number and check different data about it. The file containing the part number must then be opened with the part number marked. If there are multiple hits, the database should display a list of the various files containing the searched item number. The list should act as links that open the file with the item number selected when selecting one from the list.
Does this already exist or how do I approach it?
Today, it's all assembled into a single PDF file of more than 1000 pages, and it's a time-consuming and laborious process to maintain.
I've only used vba in connection with Excel, so maybe it's too complicated for me. But is it possible for a programmer without spending 1000 hours on it?
Please help me :-)
Either Access or Excel could do this. I noticed the Python tag. I'm sure Python could handle this as well, although it seems more like a database solution would be best. It sounds like a one-to-many scenario. See the link below for some ideas of how this technique works.
https://www.tutorialspoint.com/ms_access/ms_access_one_to_many_relationship.htm
Also, below is a link with a whole bunch of MS Access templates. Take a look at that and hopefully that will give you some ideas of how to get started.
https://www.microsoftaccessexpert.com/Microsoft-Access-Templates.aspx
I agree, keeping this in a PDF with 1000 pages is NOT the way to go!!

Converting PDF to any parse-able format

I have a PDF file which consists of tables which can spread across various pages and may have text in between. An example of it can be found here.
I am able to convert the PDF to any format but the output files are not in any way parse-able i.e. I cannot extract data out of it as they are scattered. Here are the links to the output files which I created using pdftotext and pdftohtml.
Is there a way to extract data in a more suitable way?
Thanks in advance.
The general answer is no. pdf is a format intended for visual presentation and printing, and there is no guarantee that the contents will be in any particular order let alone structured as a table in any way other than what appears when the pdf is rendered onto paper or a screen. Sometimes there is even deliberate obfuscation to prevent anyone doing what you are attempting.
In this case it appears to be possible to cut and paste the contents of each table element. For a small number of similar files that is almost certainly the quickest thing to do. Open the pdf on the left hand of your screen, a spreadsheet or data-entry program on the right hand, then cut and paste. For a medium number - tens, hundreds? - it's probably cheapest to hire a temp to do the donkey-work. For a large number - thousands? - it would be possible to create a program to automate this process, but definitely not easy. I might think about using human input via the mouse to identify the corners of the table and the horizontal / vertical divisions, then generating cut and paste operations via control of the human interface devices. Don't ask me how. I'd have to find out if I had to do this, and I'd much rather not. It's a WOMBAT.
Whatever form of analysis you did on the pdf contents would certainly not generalize to other pdfs created by different organisations using different software, and possibly not even by the same organisation using the same process but merely a later release of the same software.
Following in the line of #nigel222, it really depends on the PDF how easily you can get the data out in some useful way.
It is best if the PDF is structured (has a document structure, created when the PDF was written). In this case, you can access the structure, and you are all set.
As structure is a fundamental necessity of an accessible PDF, you may try to "massage" the document by applying the various "make accessible" utilities floating around; definitely something to follow.

Unittesting pdf generated from website

I'm writing a package which is used for generating pdf files, by posting some data to a website and retrieving a generated pdf from the data.
My problem is with the unittests. So fare I've tried to post a known dataset to the website, retrieving the pdf and compared it to a pdf which I know is good. This works fine, however there's a timestamp in the pdf which means that next day it doesn't work.
As I can see it i have three options.
One is to get rid of the timestamp in the pdf. This seems to be pretty difficult from my googling. It would probably be something like a pdf to image conversion, and then blanking out the timestamp. And then comparing to a reference file.
Option two would be to create a mock website, which i can then use for generating a mock pdf. This options seems a bit strange to me though - as I would then not test the actual connection to the website, and if I ruin something in the connection, I wouldn't catch the bug.
And three would be to just check that I retrieve some data which appears to be a pdf, and then be done with it. This way I would also get around if the website changes a comma in the generated pdf.
So, I guess my question is two-fold. 1: How difficult would the pdf to image to blanking method be, and 2: From a unittesting perspective, would it be a better approach to make a mock website or just test that I get some pdf-like data.
option 4: figure out where the time stamp lives in the pdf, and compare the bytes before and after
For example, if the time stamp is at offset 11 and is two bytes long:
with open('reference.pdf') as rf:
reference_data = rf.read()
with open('pdf_from_website.pdf') as wf:
website_data = wf.read()
self.assertEqual(reference_data[:11], website_data[:11])
self.assertEqual(reference_data[13:], website_data[13:])
I'm not familiar with the innards of pdf files so this might not work. You could use diff to see where the differences are and try, though.
For your second question: It is best if you can test that the returned pdf is both valid and has the contents it should have.

Need suggestions for designing a content organizer

I am trying to write a python program which could take content and categorize it based on the tags. I am using Nepomuk to tag files and PyQt for GUI. The problem is, I am unable to decide how to save the content. Right now, I am saving each entry individually to a text file in a folder. When I need to read the contents, I am telling the program to get all the files in that foder and then perform read operation on each file. Since the number of files is less now (less than 20), this approach is decent enough. But I am worried that when the file count increase, this method would become inefficient. Is there any other method to save content efficiently?
Thanks in advance.
You could use sqlite3 module from stdlib. Data will be stored in a single file. The code might be even simpler than the one used for reading all adhoc text files by hand.
You could always export the data in a format suitable for sharing in your case.

Insert png image programatically in pdf file at specific location

Looking for easiest way to do the following:
I have created 10,000 unique QR-codes, with unique filenames.
I have one postcard design (.ai, eps, pdf - doesn't matter) with place holder for the qr code and for a the unique filename (sans .png extension).
How would I go about inserting each of the 10.000 png's into 10,000 copies of the pdf files? (and I need to do the same with the unique filename /textstring that represents each QR code).
since I am really no good with programming it' doesn't matter which tools to use. As long as you hold my hand - or there is a link to a beginners documentation.
however:
I am trying to learn python - so that is preferred.
I work a little bit with R - but that will not be the easiest solution.
If this can be done directly from the terminal with a shell script then halliluja :-)
But really - if you know of a solution - then please post it, regardless of the tools.
Thanks in advance.
You can do it in Python using pyPdf to merge documents.
Basically, you create a PDF with your QRCode placed where you want it in the end.
You can use the (c)StringIO module to store the created PDF file in memory.
You can find pyPDF here; there's an example that shows how you would add a watermark to a file, you should be following the same logic.

Categories