I'm trying to verify that the content generated from wkhtmltopdf is the same from run to run, however every time I run wkhtmltopdf I get a different hash / checksum value against the same page. We are talking something real basic like using an html page of:
<html>
<body>
<p> This is some text</p>
</body
</html>
I get a different md5 or sha256 hash every time I run wkhtmltopdf using an amazing line of:
./wkhtmltopdf example.html ~/Documents/a.pdf
And using a python hasher of:
def shasum(filename):
sha = hashlib.sha256()
with open(filename,'rb') as f:
for chunk in iter(lambda: f.read(128*sha.block_size), b''):
sha.update(chunk)
return sha.hexdigest()
or the md5 version which just swaps sha256 with md5
Why would wkhtmltopdf generate a different file enough to cause a different checksum, and is there any way to not do that? some command line that can be passed in to prevent this?
I've tried --default-header, --no-pdf-compression and --disable-smart-shrinking
This is on a MAC osx but I've generated these pdf's on other machines and downloaded them with the same result.
wkhtmltopdf version = 0.10.0 rc2
I tried this and opened the resulting PDF in emacs. wkhtmltopdf is embedding a "/CreationDate" field in the PDF. It will be different for every run, and will screw up the hash values between runs.
I didn't see an option to disable the "/CreationDate" field, but it would be simple to strip it out of the file before computing the hash.
I wrote a method to copy the creation date from the expected output to the current generated file. It's in Ruby and the arguments are any class that walk and quack like IO:
def copy_wkhtmltopdf_creation_date(to, from)
to_current_pos, from_current_pos = [to.pos, from.pos]
to.pos = from.pos = 74
to.write(from.read(14))
to.pos, from.pos = [to_current_pos, from_current_pos]
end
I was inspired by Carlos to write a solution that doesn't use a hardcoded index, since in my documents the index differed from Carlos' 74.
Also, I don't have the files open already. And I handle the case of returning early when no CreationDate is found.
def copy_wkhtmltopdf_creation_date(to, from)
index, date = File.foreach(from).reduce(0) do |acc, line|
if line.index("CreationDate")
break [acc + line.index(/\d{14}/), $~[0]]
else
acc + line.bytesize
end
end
if date # IE, yes this is a wkhtmltopdf document
File.open(to, "r+") do |to|
to.pos = index
to.write(date)
end
end
end
We solved the problem by stripping the creation date with a simple regex.
preg_replace("/\\/CreationDate \\(D:.*\\)\\n/uim", "", $file_contents, 1);
After doing this we can get a consistent checksum every time.
Related
I would like to generate invoices with a little Python program I've written. I want it to output the invoice first as an HTML file, then convert it to a PDF. Currently my code is creating the HTML doc as desired, but it is outputting a blank PDF file:
import pdfkit
import time
name = input('Enter business name: ')
date = input('Enter the date: ')
invoice = input('Enter invoice number: ')
total = 1000 # Fixed number for now just for testing purposes
f = open('invoice.html','w')
message = f"""<html>
<head>
<link rel='stylesheet' href='style.css')>
</head>
<body>
<h1>INVOICE<span id="invoice-number"> {invoice}</h1>
<h3>{date} </h3>
<h3>{name} </h3>
<h3>Total: ${total}</h3>
</body>
</html>"""
f.write(message)
time.sleep(2)
options = {
'enable-local-file-access': None
}
pdfkit.from_file('invoice.html', 'out.pdf', options = options)
When I run the code above, the HTML doc it puts out looks fine, but the PDF it creates is blank. I receive no error messages at all. The output in Terminal is:
Loading pages (1/6)
Counting pages (2/6)
Resolving links (4/6)
Loading headers and footers (5/6)
Printing pages (6/6)
Done
However, I have another file called test.py in the same directory. This file contains the import statement, exactly the same last four lines as the first code, and nothing else:
import pdfkit
options = {
'enable-local-file-access': None
}
pdfkit.from_file('invoice.html', 'out.pdf', options = options)
If I run this file after the previous file, it correctly outputs the PDF to look just like the HTML version. Why does this code work when run in a separate file, but doesn't work when included in the original file? How can I get it to run in the same file so I don't have to execute two commands to get an invoice?
You never close the file you're writing to, or explicitly flush it. Since it's a small text snippet, and buffering is enabled by default, it doesn't get written to disk until after your process terminates. You don't see an error because the file does get created immediately as it's opened though. The issue is not one of timing as you seem to suppose.
There are any number of solutions to this problem. The simplest and most correct is to close the file as soon as you're finished writing to it. This is how you open a file idiomatically in python:
name = input('Enter business name: ')
date = input('Enter the date: ')
total = 1000 # Fixed number for now just for testing purposes
message = ...
options = {
'enable-local-file-access': None
}
with open('invoice.html','w') as f:
f.write(message)
pdfkit.from_file('invoice.html', 'out.pdf', options = options)
Notice that you don't need to wait for anything: exiting the with block will close the file and flush it in the process even if an error occurs. You should get in the habit of closing files as soon as you finish work on them: most OSes support a finite number of file handles per process. That means that putting the call to pdfkit.from_file after the with block guarantees that it will have a fully flushed file to work with.
Here are some other methods, which are not idiomatic, and are not recommended in practice. I'm providing them just to give you a sense of how the different steps work:
Flush the file before attempting to read it. Instead of sleep, call
f.flush()
This will leave the file object open and not necessarily handle errors correctly, but it will ensure that their HTML content is written out before you attempt to read it.
Turn off buffering. Buffering means that until you write 4kb or so of data (whatever your disk block size is), none of it will be written to disk until you flush it. You can disable this behavior so bytes get written out immediately by calling open like this
open('invoice.html', 'w', buffering=0)
I am trying to solve a problem:
I receive auto-generated email from government with no tags in HTML. It's one table nested upon another. An abomination of a template. I get it every few days and I want to extract some fields from it. My idea was this
Use HTML in the email as template. Remove all fields that change with every mail like Name of my client, their Unique ID and issue explained in the mail.
Use this html template with missing fields and diff it with new emails. That will give me all the new info in one shot without having to parse this email.
Problem is, I can't find any way of loading only these additions. I am trying to use difflib in python and it returns byte streams of additions and subtractions in each line that I am not able to process properly. I want to find a way to only return the additions and nothing else. I am open to using other libraries or methods. I do not want to write a huge regex with tons of html.
When I got the stdout from using Popen calling diff it also returned bytes.
You can convert the bytes to chars, then continue with your processing.
You could do something similar to what I do below to convert your bytes to a string
The below calls diff on two files and prints only the lines beginning with the '>' symbol (new in the rhs file):
#! /usr/env python
import os
import sys, subprocess
file1 = 'test1'
file2 = 'test2'
if len(sys.argv)==3:
file1=sys.argv[1]
file2=sys.argv[2]
if not os.access(file1,os.R_OK):
print(f'Unable to read: \'{file1}\'')
sys.exit(1)
if not os.access(file2,os.R_OK):
print(f'Unable to read: \'{file2}\'')
sys.exit(1)
argv = ['diff',file1,file2]
runproc = subprocess.Popen(args=argv, stdout=subprocess.PIPE)
out, err = runproc.communicate()
outstr=''
for c in out:
outstr+=chr(c)
for line in outstr.split('\n'):
if len(line)==0:
continue
if line[0]=='>':
print(line)
I have taken the code from another thread here that uses the library PyPDF2 to parse and replace the text of a PDF. The given example PDF in the thread is parsed as a PyPDF2.generic.DecodedStreamObject. I am currently working with a PDF that the company has provided me that was created using Microsoft Word's Export to PDF feature. This generates a PyPDF2.generic.EncodedStreamObject. From exploration, the main difference is that there is what appears to be kerning in some places in the text.
This caused two problems for me with the sample code. Firstly, the line if len(contents) > 0: in main seems to get erroneously triggered and attempts to use the key of the EncodedStreamObject dictionary instead of the EncodedStreamObject itself. To work around this, I commented out the if block and used the code in the else block for both cases.
The second problem was that the (what I assume are) kerning markings broke up the text I was trying to replace. I noticed that kerning was not in every line, so I made the assumption that the kerning markers were not strictly necessary, and tried to see what the output would look like with them removed. The text was structured something like so: [(Thi)4(s)-1(is t)2(ext)]. I replaced the line in the sample code replaced_line = line in replace_text with replaced_line = "[(" + "".join(re.findall(r'\((.*?)\)', line)) + ")] TJ". This preserved the observed structure while allowing the text to be searched for replacements. I verified this was actually replacing the text of the line.
Neither of those changes prevented the code from executing, however the output PDF seems to be completely unchanged despite the code appearing to work using print statements to check if the replaced line has the new text. I initially assumed this was because of the if block in process_data that determined if it was Encoded or Decoded. However, I dug through the actual source code for this library located here, and it seems that if the object is Encoded, it generates a Decoded version of itself which the if block reflects. My only other idea is that the if block that I commented out in main wasn't erroneously catching my scenario, but was instead handling it incorrectly. I have no idea how I would fix it so that it handles it properly.
I feel like I'm incredibly close to solving this, but I'm at my wits end as to what to do from here. I would ask the poster of the linked solution in a comment, but I do not have enough reputation to comment on SO. Does anyone have any leads on how to solve this problem? I don't particularly care what library or file format is used, but it must retain the formatting of the Word document I have been provided. I have already tried exporting to HTML, but that removes most of the formatting and also the header. I have also tried converting the .docx to PDF in Python, but that requires me to actually have Word installed on the machine, which is not a cross-platform solution. I also explored using RTF, but from what I found the solution for that file type is to convert it to a .docx and then to PDF.
Here is the full code that I have so far:
import PyPDF2
import re
def replace_text(content, replacements=dict()):
lines = content.splitlines()
result = ""
in_text = False
for line in lines:
if line == "BT":
in_text = True
elif line == "ET":
in_text = False
elif in_text:
cmd = line[-2:]
if cmd.lower() == 'tj':
replaced_line = "[(" + "".join(re.findall(r'\((.*?)\)', line)) + ")] TJ"
for k, v in replacements.items():
replaced_line = replaced_line.replace(k, v)
result += replaced_line + "\n"
else:
result += line + "\n"
continue
result += line + "\n"
return result
def process_data(obj, replacements):
data = obj.getData()
decoded_data = data.decode('utf-8')
replaced_data = replace_text(decoded_data, replacements)
encoded_data = replaced_data.encode('utf-8')
if obj.decodedSelf is not None:
obj.decodedSelf.setData(encoded_data)
else:
obj.setData(encoded_data)
pdf = PyPDF2.PdfFileReader("template.pdf")
# pdf = PyPDF2.PdfFileReader("sample.pdf")
writer = PyPDF2.PdfFileWriter()
replacements = {
"some text": "replacement text"
}
for page in pdf.pages:
contents = page.getContents()
# if len(contents) > 0:
# for obj in contents:
# streamObj = obj.getObject()
# process_data(streamObj, replacements)
# else:
process_data(contents, replacements)
writer.addPage(page)
with open("output.pdf", 'wb') as out_file:
writer.write(out_file)
EDIT:
I've somewhat tracked down the source of my problems. The line obj.decodedSelf.setData(encoded_data) seems to not actually set the data properly. After that line, I added
print(encoded_data[:2000])
print("----------------------")
print(obj.getData()[:2000])
The first print statement was different from the second print statement, which definitely should not be the case. To really test see if this was true, I replaced every single line with [()], which I know to be valid as there are many lines that are already that. For the life of me, though, I can't figure out why this function call fails to do any lasting changes.
EDIT 2:
I have further identified the problem. In the source code for an EncodedStreamObject in the getData method, it returnsself.decodedSelf.getData() if self.decodedSelf is True. HOWEVER, after doing obj.decodedSelf.setData(encoded_data), if I do print(bool(obj.decodedSelf)), it prints False. This means that when the EncodedStreamObject is getting accessed to be written out to the PDF, it is re-parsing the old PDF and overriding the self.decodedSelf object! Short of going in and fixing the source code, I'm not sure how I would solve this problem.
EDIT 3:
I have managed to convince the library to use the decoded version that has the replacements! By inserting the line page[PyPDF2.pdf.NameObject("/Contents")] = contents.decodedSelf before writer.addPage(page), it forces the page to have the updated contents. Unfortunately, my previous assumption about the text kerning was incorrect. After I replaced things, some of my text mysteriously disappeared from the PDF. I assume this is because the format is incorrect somehow.
FINAL EDIT:
I figure I'd put this in here in case anyone else stumbles across this. I never did manage to get it to finally work as expected. I instead moved to a solution to mimic the PDF with HTML/CSS. If you add the following style tag in your HTML, you can get it to print more like how you'd expect a PDF to print:
<style type="text/css" media="print">
#page {
size: auto;
margin: 0;
}
</style>
I'd recommend this solution for anyone looking to do what I was doing. There are Python libraries to convert HTML to CSS, but they do not support HTML5 and CSS3 (notably they do not support CSS flex or grid). You can just print the HTML page to PDF from any browser to accomplish the same thing. It definitely doesn't answer the question, so I felt it best to leave it as an edit. If anyone manages to complete what I have attempted, please post an answer for me and any others.
I am using Python to generate a C++ header file. It is security classified, so I can't post it here.
I generate it based on certain inputs and, if those don't change, the same file should be generated.
Because it is a header file which is #included almost everywhere, touching it causes a full build. So, if there is no change, I do not want to generate the file.
The simplest approach seemed to be to generate the file in /tmp then take an MD5 hash of the existing file, to see if it needs to be updated.
existingFileMd5 = hashlib.md5(open(headerFilePath, 'rb').read())
newFileMd5 = hashlib.md5(open(tempFilePath, 'rb').read())
if newFileMd5 == existingFileMd5:
print('Info: file "' + headerFilePath + '" unchanged, so not updated')
os.remove(tempFilePath)
else:
shutil.move(tempFilePath, headerFilePath)
print('Info: file "' + headerFilePath + '" updated')
However, when I run the script twice in quick succession (without changing the inputs), it seems to always think that the MD5 hashes are different and updates the file, thus reducing build time.
There are no variable parts to the file, other than those governed by the input. E.g, I am not writing a timestamp.
I have had colleagues eyeball the two files and declare them to be identical (they are quite small). They are also declared to be identical by Linux's meld file compare utility.
So, the problem would seem to be with the code posted above. What am I doing wrong?
You forgot to actually ask for the hashes. You're comparing two md5-hasher-thingies, not the hashes.
Call digest to get the hash as a bytes object, or hexdigest to get a string with a hex encoding of the hash:
if newFileMd5.digest() == existingFileMd5.digest():
...
a way of checking if a file has been modified, is calculating and storing a hash (or checksum) for the file. Then at any point the hash can be re-calculated and compared against the stored value.
I'm wondering if there is a way to store the hash of a file in the file itself? I'm thinking text files.
The algorithm to calculate the hash should be iterative and consider that the hash will be added to the file the hash is being calculated for... makes sense? Anything available?
Thanks!
edit:
https://security.stackexchange.com/questions/3851/can-a-file-contain-its-md5sum-inside-it
from Crypto.Hash import HMAC
secret_key = "Don't tell anyone"
h = HMAC.new(secret_key)
text = "whatever you want in the file"
## or: text = open("your_file_without_hash_yet").read()
h.update(text)
with open("file_with_hash") as fh:
fh.write(text)
fh.write(h.hexdigest())
Now, as some people tried to point out, though they seemed confused - you need to remember that this file has the hash on the end of it and that the hash is itself not part of what gets hashed. So when you want to check the file, you would do something along the lines of:
end_len = len(h.hex_digest())
all_text = open("file_with_hash").read()
text, expected_hmac = all_text[:end_len], all_text[end_len:]
h = HMAC.new(secret_key)
h.update(text)
if h.hexdigest() != expected_hmac:
raise "Somebody messed with your file!"
It should be clear though that this alone doesn't ensure your file hasn't been changed; the typical use case is to encrypt your file, but take the hash of the plaintext. That way, if someone changes the hash (at the end of the file) or tries changing any of the characters in the message (the encrypted portion), things will mismatch and you will know something was changed.
A malicious actor won't be able to change the file AND fix the hash to match because they would need to change some data, and then rehash everything with your private key. So long as no one knows your private key, they won't know how to recreate the correct hash.
This is an interesting question. You can do it if you adopt a proper convention for hashing and verifying the integrity of the files. Suppose you have this file, namely, main.py:
#!/usr/bin/env python
# encoding: utf-8
print "hello world"
Now, you could append an SHA-1 hash to the python file as a comment:
(printf '#'; cat main.py | sha1sum) >> main.py
Updated main.py:
#!/usr/bin/env python
# encoding: utf-8
print "hello world"
#30e3b19d4815ff5b5eca3a754d438dceab9e8814 -
Hence, to verify if the file was modified you can do this in Bash:
if [ "$(printf '#';head -n-1 main.py | sha1sum)" == "$(tail -n1 main.py)" ]
then
echo "Unmodified"
else
echo "Modified"
fi
Of course, someone could try to fool you by changing the hash string manually. In order to stop these bad guys, you can improve the system by tempering the file with a secret string before adding the hash to the last line.
Improved version
Add the hash in the last line including your secret string:
(printf '#';cat main.py;echo 'MyUltraSecretTemperString12345') | sha1sum >> main.py
For checking if the file was modified:
if [ "$(printf '#';(head -n-1 main.py; echo 'MyUltraSecretTemperString12345') | sha1sum)" == "$(tail -n1 main.py)" ]
then
echo "Unmodified"
else
echo "Modified"
fi
Using this improved version, the bad guys only can fool you if they find your ultra secret key first.
EDIT: This is a rough implementation of the keyed-hash message authentication code (HMAC).
Well although it looks like a strange idea, it could be an application of a little used but very powerful property of windows NTFS file system: the File Streams.
It allows to add many streams to a file without changing the content of the default stream. For example:
echo foo > foo.text
echo bar > foo.text:alt
type foo.text
=> foo
more < foo.text:alt
=> bar
But when listing the directory, you can only see one single file: foo.txt
So in your use case, you could write the hash of main stream in stream named hash, and later compare the content of the hash stream with the hash of the main stream.
Just a remark: for a reason I do not know, type foo.text:alt generates the following error:
"The filename, directory name, or volume label syntax is incorrect."
that's why my example uses more < as recommended in the Using streams page on MSDN
So assuming you have a myhash function that gives the hash for a file (you can easily build one by using the hashlib module):
def myhash(filename):
# compute the hash of the file
...
return hash_string
You can do:
def store_hash(filename):
hash_string = myhash(filename)
with open(filename + ":hash") as fd:
fd.write(hash_string)
def compare_hash(filename):
hash_string = myhash(filename)
with open(filename + ":hash") as fd:
orig = fd.read()
return (hash_string == orig)