Calculate and add hash to a file in python

Calculate and add hash to a file in python - python

a way of checking if a file has been modified, is calculating and storing a hash (or checksum) for the file. Then at any point the hash can be re-calculated and compared against the stored value.
I'm wondering if there is a way to store the hash of a file in the file itself? I'm thinking text files.
The algorithm to calculate the hash should be iterative and consider that the hash will be added to the file the hash is being calculated for... makes sense? Anything available?
Thanks!
edit:
https://security.stackexchange.com/questions/3851/can-a-file-contain-its-md5sum-inside-it

from Crypto.Hash import HMAC
secret_key = "Don't tell anyone"
h = HMAC.new(secret_key)
text = "whatever you want in the file"
## or: text = open("your_file_without_hash_yet").read()
h.update(text)
with open("file_with_hash") as fh:
fh.write(text)
fh.write(h.hexdigest())
Now, as some people tried to point out, though they seemed confused - you need to remember that this file has the hash on the end of it and that the hash is itself not part of what gets hashed. So when you want to check the file, you would do something along the lines of:
end_len = len(h.hex_digest())
all_text = open("file_with_hash").read()
text, expected_hmac = all_text[:end_len], all_text[end_len:]
h = HMAC.new(secret_key)
h.update(text)
if h.hexdigest() != expected_hmac:
raise "Somebody messed with your file!"
It should be clear though that this alone doesn't ensure your file hasn't been changed; the typical use case is to encrypt your file, but take the hash of the plaintext. That way, if someone changes the hash (at the end of the file) or tries changing any of the characters in the message (the encrypted portion), things will mismatch and you will know something was changed.
A malicious actor won't be able to change the file AND fix the hash to match because they would need to change some data, and then rehash everything with your private key. So long as no one knows your private key, they won't know how to recreate the correct hash.

This is an interesting question. You can do it if you adopt a proper convention for hashing and verifying the integrity of the files. Suppose you have this file, namely, main.py:
#!/usr/bin/env python
# encoding: utf-8
print "hello world"
Now, you could append an SHA-1 hash to the python file as a comment:
(printf '#'; cat main.py | sha1sum) >> main.py
Updated main.py:
#!/usr/bin/env python
# encoding: utf-8
print "hello world"
#30e3b19d4815ff5b5eca3a754d438dceab9e8814 -
Hence, to verify if the file was modified you can do this in Bash:
if [ "$(printf '#';head -n-1 main.py | sha1sum)" == "$(tail -n1 main.py)" ]
then
echo "Unmodified"
else
echo "Modified"
fi
Of course, someone could try to fool you by changing the hash string manually. In order to stop these bad guys, you can improve the system by tempering the file with a secret string before adding the hash to the last line.
Improved version
Add the hash in the last line including your secret string:
(printf '#';cat main.py;echo 'MyUltraSecretTemperString12345') | sha1sum >> main.py
For checking if the file was modified:
if [ "$(printf '#';(head -n-1 main.py; echo 'MyUltraSecretTemperString12345') | sha1sum)" == "$(tail -n1 main.py)" ]
then
echo "Unmodified"
else
echo "Modified"
fi
Using this improved version, the bad guys only can fool you if they find your ultra secret key first.
EDIT: This is a rough implementation of the keyed-hash message authentication code (HMAC).

Well although it looks like a strange idea, it could be an application of a little used but very powerful property of windows NTFS file system: the File Streams.
It allows to add many streams to a file without changing the content of the default stream. For example:
echo foo > foo.text
echo bar > foo.text:alt
type foo.text
=> foo
more < foo.text:alt
=> bar
But when listing the directory, you can only see one single file: foo.txt
So in your use case, you could write the hash of main stream in stream named hash, and later compare the content of the hash stream with the hash of the main stream.
Just a remark: for a reason I do not know, type foo.text:alt generates the following error:
"The filename, directory name, or volume label syntax is incorrect."
that's why my example uses more < as recommended in the Using streams page on MSDN
So assuming you have a myhash function that gives the hash for a file (you can easily build one by using the hashlib module):
def myhash(filename):
# compute the hash of the file
...
return hash_string
You can do:
def store_hash(filename):
hash_string = myhash(filename)
with open(filename + ":hash") as fd:
fd.write(hash_string)
def compare_hash(filename):
hash_string = myhash(filename)
with open(filename + ":hash") as fd:
orig = fd.read()
return (hash_string == orig)

Related

Python hashes don't match

I am using Python to generate a C++ header file. It is security classified, so I can't post it here.
I generate it based on certain inputs and, if those don't change, the same file should be generated.
Because it is a header file which is #included almost everywhere, touching it causes a full build. So, if there is no change, I do not want to generate the file.
The simplest approach seemed to be to generate the file in /tmp then take an MD5 hash of the existing file, to see if it needs to be updated.
existingFileMd5 = hashlib.md5(open(headerFilePath, 'rb').read())
newFileMd5 = hashlib.md5(open(tempFilePath, 'rb').read())
if newFileMd5 == existingFileMd5:
print('Info: file "' + headerFilePath + '" unchanged, so not updated')
os.remove(tempFilePath)
else:
shutil.move(tempFilePath, headerFilePath)
print('Info: file "' + headerFilePath + '" updated')
However, when I run the script twice in quick succession (without changing the inputs), it seems to always think that the MD5 hashes are different and updates the file, thus reducing build time.
There are no variable parts to the file, other than those governed by the input. E.g, I am not writing a timestamp.
I have had colleagues eyeball the two files and declare them to be identical (they are quite small). They are also declared to be identical by Linux's meld file compare utility.
So, the problem would seem to be with the code posted above. What am I doing wrong?

You forgot to actually ask for the hashes. You're comparing two md5-hasher-thingies, not the hashes.
Call digest to get the hash as a bytes object, or hexdigest to get a string with a hex encoding of the hash:
if newFileMd5.digest() == existingFileMd5.digest():
...

how to "source" file into python script

I have a text file /etc/default/foo which contains one line:
FOO="/path/to/foo"
In my python script, I need to reference the variable FOO.
What is the simplest way to "source" the file /etc/default/foo into my python script, same as I would do in bash?
. /etc/default/foo

Same answer as #jil however, that answer is specific to some historical version of Python.
In modern Python (3.x):
exec(open('filename').read())
replaces execfile('filename') from 2.x

You could use execfile:
execfile("/etc/default/foo")
But please be aware that this will evaluate the contents of the file as is into your program source. It is potential security hazard unless you can fully trust the source.
It also means that the file needs to be valid python syntax (your given example file is).

Keep in mind that if you have a "text" file with this content that has a .py as the file extension, you can always do:
import mytextfile
print(mytestfile.FOO)
Of course, this assumes that the text file is syntactically correct as far as Python is concerned. On a project I worked on we did something similar to this. Turned some text files into Python files. Wacky but maybe worth consideration.

Just to give a different approach, note that if your original file is setup as
export FOO=/path/to/foo
You can do source /etc/default/foo; python myprogram.py (or . /etc/default/foo; python myprogram.py) and within myprogram.py all the values that were exported in the sourced' file are visible in os.environ, e.g
import os
os.environ["FOO"]

If you know for certain that it only contains VAR="QUOTED STRING" style variables, like this:
FOO="some value"
Then you can just do this:
>>> with open('foo.sysconfig') as fd:
... exec(fd.read())
Which gets you:
>>> FOO
'some value'
(This is effectively the same thing as the execfile() solution
suggested in the other answer.)
This method has substantial security implications; if instead of FOO="some value" your file contained:
os.system("rm -rf /")
Then you would be In Trouble.
Alternatively, you can do this:
>>> with open('foo.sysconfig') as fd:
... settings = {var: shlex.split(value) for var, value in [line.split('=', 1) for line in fd]}
Which gets you a dictionary settings that has:
>>> settings
{'FOO': ['some value']}
That settings = {...} line is using a dictionary comprehension. You could accomplish the same thing in a few more lines with a for loop and so forth.
And of course if the file contains shell-style variable expansion like ${somevar:-value_if_not_set} then this isn't going to work (unless you write your very own shell style variable parser).

There are a couple ways to do this sort of thing.
You can indeed import the file as a module, as long as the data it contains corresponds to python's syntax. But either the file in question is a .py in the same directory as your script, either you're to use imp (or importlib, depending on your version) like here.
Another solution (that has my preference) can be to use a data format that any python library can parse (JSON comes to my mind as an example).
/etc/default/foo :
{"FOO":"path/to/foo"}
And in your python code :
import json
with open('/etc/default/foo') as file:
data = json.load(file)
FOO = data["FOO"]
## ...
file.close()
This way, you don't risk to execute some uncertain code...
You have the choice, depending on what you prefer. If your data file is auto-generated by some script, it might be easier to keep a simple syntax like FOO="path/to/foo" and use imp.
Hope that it helps !

The Solution
Here is my approach: parse the bash file myself and process only variable assignment lines such as:
FOO="/path/to/foo"
Here is the code:
import shlex
def parse_shell_var(line):
"""
Parse such lines as:
FOO="My variable foo"
:return: a tuple of var name and var value, such as
('FOO', 'My variable foo')
"""
return shlex.split(line, posix=True)[0].split('=', 1)
if __name__ == '__main__':
with open('shell_vars.sh') as f:
shell_vars = dict(parse_shell_var(line) for line in f if '=' in line)
print(shell_vars)
How It Works
Take a look at this snippet:
shell_vars = dict(parse_shell_var(line) for line in f if '=' in line)
This line iterates through the lines in the shell script, only process those lines that has the equal sign (not a fool-proof way to detect variable assignment, but the simplest). Next, run those lines into the function parse_shell_var which uses shlex.split to correctly handle the quotes (or the lack thereof). Finally, the pieces are assembled into a dictionary. The output of this script is:
{'MOO': '/dont/have/a/cow', 'FOO': 'my variable foo', 'BAR': 'My variable bar'}
Here is the contents of shell_vars.sh:
FOO='my variable foo'
BAR="My variable bar"
MOO=/dont/have/a/cow
echo $FOO
Discussion
This approach has a couple of advantages:
It does not execute the shell (either in bash or in Python), which avoids any side-effect
Consequently, it is safe to use, even if the origin of the shell script is unknown
It correctly handles values with or without quotes
This approach is not perfect, it has a few limitations:
The method of detecting variable assignment (by looking for the presence of the equal sign) is primitive and not accurate. There are ways to better detect these lines but that is the topic for another day
It does not correctly parse values which are built upon other variables or commands. That means, it will fail for lines such as:
FOO=$BAR
FOO=$(pwd)

Based off the answer with exec(.read()), value = eval(.read()), it will only return the value. E.g.
1 + 1: 2
"Hello Word": "Hello World"
float(2) + 1: 3.0

Postgres COPY FROM file throwing unicode error while referenced character apparently not in file

First of all, thank you to everyone on Stack Overflow for past, present, and future help. You've all saved me from disaster (both of my own design and otherwise) too many times to count.
The present issue is part of a decision at my firm to transition from a Microsoft SQL Server 2005 database to PostgreSQL 9.4. We have been following the notes on the Postgres wiki (https://wiki.postgresql.org/wiki/Microsoft_SQL_Server_to_PostgreSQL_Migration_by_Ian_Harding), and these are the steps we're following for the table in question:
Download table data [on Windows client]:
bcp "Carbon.consensus.observations" out "Carbon.consensus.observations" -k -S [servername] -T -w
Copy to Postgres server [running CentOS 7]
Run Python pre-processing script on Postgres server to change encoding and clean:
import sys
import os
import re
import codecs
import fileinput
base_path = '/tmp/tables/'
cleaned_path = '/tmp/tables_processed/'
files = os.listdir(base_path)
for filename in files:
source_path = base_path + filename
temp_path = '/tmp/' + filename
target_path = cleaned_path + filename
BLOCKSIZE = 1048576 # or some other, desired size in bytes
with open(source_path, 'r') as source_file:
with open(target_path, 'w') as target_file:
start = True
while True:
contents = source_file.read(BLOCKSIZE).decode('utf-16le')
if not contents:
break
if start:
if contents.startswith(codecs.BOM_UTF8.decode('utf-8')):
contents = contents.replace(codecs.BOM_UTF8.decode('utf-8'), ur'')
contents = contents.replace(ur'\x80', u'')
contents = re.sub(ur'\000', ur'', contents)
contents = re.sub(ur'\r\n', ur'\n', contents)
contents = re.sub(ur'\r', ur'\\r', contents)
target_file.write(contents.encode('utf-8'))
start = False
for line in fileinput.input(target_path, inplace=1):
if '\x80' in line:
line = line.replace(r'\x80', '')
sys.stdout.write(line)
Execute SQL to load table:
COPY consensus.observations FROM '/tmp/tables_processed/Carbon.consensus.observations';
The issue is that the COPY command is failing with a unicode error:
[2015-02-24 19:52:24] [22021] ERROR: invalid byte sequence for encoding "UTF8": 0x80
Where: COPY observations, line 2622420: "..."
Given that this could very likely be because of bad data in the table (which also contains legitimate non-ASCII characters), I'm trying to find the actual byte sequence in context, and I can't find it anywhere (sed to look at the line in question, regexes to replace the character as part of the preprocessing, etc). For reference, this grep returns nothing:
cat /tmp/tables_processed/Carbon.consensus.observations | grep --color='auto' -P "[\x80]"
What am I doing wrong in tracking down where this byte sequence sits in context?

I would recommend loading the SQL file (which appears to be /tmp/tables_processed/Carbon.consensus.observations) into an editor that has a hex mode. This should allow you to see it (depending on the exact editor) in context.
gVim (or terminal-based Vim) is one option I would recommend.
For example, if I open in gVim an SQL copy file that has this content:
1 1.2
2 1.1
3 3.2
I can the convert it into hex mode via the command %!xxd (in gVim or terminal Vim) or the Menu option Tools > Convert to HEX.
That yields this display:
0000000: 3109 312e 320a 3209 312e 310a 3309 332e 1.1.2.2.1.1.3.3.
0000010: 320a 2.
You can then run %!xxd -r to convert it back, or the Menu option Tools > Convert back.
Note: This actually modifies the file, so it would be advisable to do this to a copy of the original, just in case the changes somehow get written (you would have to explicitly save the buffer in Vim).
This way, you can see both the hex sequences on the left, and their ASCII equivalent on the right. If you search for 80, you should be able to see it in context. With gVim, the line numbering will be different for both modes, though, as is evidenced by this example.
It's likely the first 80 you find will be that line, though, since if there were earlier ones, it likely would've failed on those instead.
Another tool which might help that I've used in the past is the graphical hex editor GHex. Since that's a GNOME project, not quite sure it'll work with CentOS. wxHexEditor supposedly works with CentOS and looks promising from the website, although I've not yet used it. It's pitched as a "hex editor for massive files", so if your SQL file is large, that might be the way to go.

wkhtmltopdf generates a different checksum on every run

I'm trying to verify that the content generated from wkhtmltopdf is the same from run to run, however every time I run wkhtmltopdf I get a different hash / checksum value against the same page. We are talking something real basic like using an html page of:
<html>
<body>
<p> This is some text</p>
</body
</html>
I get a different md5 or sha256 hash every time I run wkhtmltopdf using an amazing line of:
./wkhtmltopdf example.html ~/Documents/a.pdf
And using a python hasher of:
def shasum(filename):
sha = hashlib.sha256()
with open(filename,'rb') as f:
for chunk in iter(lambda: f.read(128*sha.block_size), b''):
sha.update(chunk)
return sha.hexdigest()
or the md5 version which just swaps sha256 with md5
Why would wkhtmltopdf generate a different file enough to cause a different checksum, and is there any way to not do that? some command line that can be passed in to prevent this?
I've tried --default-header, --no-pdf-compression and --disable-smart-shrinking
This is on a MAC osx but I've generated these pdf's on other machines and downloaded them with the same result.
wkhtmltopdf version = 0.10.0 rc2

I tried this and opened the resulting PDF in emacs. wkhtmltopdf is embedding a "/CreationDate" field in the PDF. It will be different for every run, and will screw up the hash values between runs.
I didn't see an option to disable the "/CreationDate" field, but it would be simple to strip it out of the file before computing the hash.

I wrote a method to copy the creation date from the expected output to the current generated file. It's in Ruby and the arguments are any class that walk and quack like IO:
def copy_wkhtmltopdf_creation_date(to, from)
to_current_pos, from_current_pos = [to.pos, from.pos]
to.pos = from.pos = 74
to.write(from.read(14))
to.pos, from.pos = [to_current_pos, from_current_pos]
end

I was inspired by Carlos to write a solution that doesn't use a hardcoded index, since in my documents the index differed from Carlos' 74.
Also, I don't have the files open already. And I handle the case of returning early when no CreationDate is found.
def copy_wkhtmltopdf_creation_date(to, from)
index, date = File.foreach(from).reduce(0) do |acc, line|
if line.index("CreationDate")
break [acc + line.index(/\d{14}/), $~[0]]
else
acc + line.bytesize
end
end
if date # IE, yes this is a wkhtmltopdf document
File.open(to, "r+") do |to|
to.pos = index
to.write(date)
end
end
end

We solved the problem by stripping the creation date with a simple regex.
preg_replace("/\\/CreationDate \\(D:.*\\)\\n/uim", "", $file_contents, 1);
After doing this we can get a consistent checksum every time.

File Manipulation: Scripting Question

I have a script which connects to database and gets all records which statisfy the query. These record results are files present on a server, so now I have a text file which has all file names in it.
I want a script which would know:
What is the size of each file in the output.txt file?
What is the total size of all the files present in that text file?
Update:
I would like to know how can I achieve my task using Perl programming language, any inputs would be highly appreciated.
Note: I do not have any specific language constraint, it could be either Perl or Python scripting language which I can run from the Unix prompt. Currently I am using the bash shell and have sh and py script. How can this be done?
My scripts:
#!/usr/bin/ksh
export ORACLE_HOME=database specific details
export PATH=$ORACLE_HOME/bin:path information
sqlplus database server information<<EOF
SET HEADING OFF
SET ECHO OFF
SET PAGESIZE 0
SET LINESIZE 1000
SPOOL output.txt
select * from my table_name;
SPOOL OFF
EOF
I know du -h would be the command which I should be using but I am not sure how should my script be, I have tried something in python. I am totally new to Python and it's my first time effort.
Here it is:
import os
folderpath='folder_path'
file=open('output file which has all listing of query result','r')
for line in file:
filename=line.strip()
filename=filename.replace(' ', '\ ')
fullpath=folderpath+filename
# print (fullpath)
os.system('du -h '+fullpath)
File names in the output text file for example are like: 007_009_Bond Is Here_009_Yippie.doc
Any guidance would be highly appreciated.
Update:
How can I move all the files which are present in output.txt file to some other folder location using Perl ?
After doing step1, how can I delete all the files which are present in output.txt file ?
Any suggestions would be highly appreciated.

In perl, the -s filetest operator is probaby what you want.
use strict;
use warnings;
use File::Copy;
my $folderpath = 'the_path';
my $destination = 'path/to/destination/directory';
open my $IN, '<', 'path/to/infile';
my $total;
while (<$IN>) {
chomp;
my $size = -s "$folderpath/$_";
print "$_ => $size\n";
$total += $size;
move("$folderpath/$_", "$destination/$_") or die "Error when moving: $!";
}
print "Total => $total\n";
Note that -s gives size in bytes not blocks like du.
On further investigation, perl's -s is equivalent to du -b. You should probably read the man pages on your specific du to make sure that you are actually measuring what you intend to measure.
If you really want the du values, change the assignment to $size above to:
my ($size) = split(' ', `du "$folderpath/$_"`);

Eyeballing, you can make YOUR script work this way:
1) Delete the line filename=filename.replace(' ', '\ ') Escaping is more complicated than that, and you should just quote the full path or use a Python library to escape it based on the specific OS;
2) You are probably missing a delimiter between the path and the file name;
3) You need single quotes around the full path in the call to os.system.
This works for me:
#!/usr/bin/python
import os
folderpath='/Users/andrew/bin'
file=open('ft.txt','r')
for line in file:
filename=line.strip()
fullpath=folderpath+"/"+filename
os.system('du -h '+"'"+fullpath+"'")
The file "ft.txt" has file names with no path and the path part is '/Users/andrew/bin'. Some of the files have names that would need to be escaped, but that is taken care of with the single quotes around the file name.
That will run du -h on each file in the .txt file, but does not give you the total. This is fairly easy in Perl or Python.
Here is a Python script (based on yours) to do that:
#!/usr/bin/python
import os
folderpath='/Users/andrew/bin/testdir'
file=open('/Users/andrew/bin/testdir/ft.txt','r')
blocks=0
i=0
template='%d total files in %d blocks using %d KB\n'
for line in file:
i+=1
filename=line.strip()
fullpath=folderpath+"/"+filename
if(os.path.exists(fullpath)):
info=os.stat(fullpath)
blocks+=info.st_blocks
print `info.st_blocks`+"\t"+fullpath
else:
print '"'+fullpath+"'"+" not found"
print `blocks`+"\tTotal"
print " "+template % (i,blocks,blocks*512/1024)
Notice that you do not have to quote or escape the file name this time; Python does it for you. This calculates file sizes using allocation blocks; the same way that du does it. If I run du -ahc against the same files that I have listed in ft.txt I get the same number (well kinda; du reports it as 25M and I get the report as 24324 KB) but it reports the same number of blocks. (Side note: "blocks" are always assumed to be 512 bytes under Unix even though the actual block size on larger disc is always larger.)
Finally, you may want to consider making your script so that it can read a command line group of files rather than hard coding the file and the path in the script. Consider:
#!/usr/bin/python
import os, sys
total_blocks=0
total_files=0
template='%d total files in %d blocks using %d KB\n'
print
for arg in sys.argv[1:]:
print "processing: "+arg
blocks=0
i=0
file=open(arg,'r')
for line in file:
abspath=os.path.abspath(arg)
folderpath=os.path.dirname(abspath)
i+=1
filename=line.strip()
fullpath=folderpath+"/"+filename
if(os.path.exists(fullpath)):
info=os.stat(fullpath)
blocks+=info.st_blocks
print `info.st_blocks`+"\t"+fullpath
else:
print '"'+fullpath+"'"+" not found"
print "\t"+template % (i,blocks,blocks*512/1024)
total_blocks+=blocks
total_files+=i
print template % (total_files,total_blocks,total_blocks*512/1024)
You can then execute the script (after chmod +x [script_name].py) by ./script.py ft.txt and it will then use the path to the command line file as the assumed path to the files "ft.txt". You can process multiple files as well.

You can do it in your shell script itself.
You have all the files names in your spooled file output.txt, all you have to add at the end of existing script is:
< output.txt du -h
It will give size of each file and also a total at the end.

You can use the Python skeleton that you've sketched out and add os.path.getsize(fullpath) to get the size of individual file.
For example, if you wanted a dictionary with the file name and size you could:
dict((f, os.path.getsize(f)) for f in file)
Keep in mind that the result from os.path.getsize(...) is in bytes so you'll have to convert it to get other units if you want.
In general os.path is a key module for manipulating files and paths.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.