Python - hashlib won't "pick up" the files in a routine

Python - hashlib won't "pick up" the files in a routine - python

I'm using the hashlib func in a script I'm writing, and I can't get it to take the files I'm pointing it towards - its only returning the hash of a 0kb file d41d8cd98f00b204e9800998ecf8427e.
I'm calling it like so: fHash=md5Checksum(f) where f is a full path string of a file.
The md5Checksum func is a lift from here: http://www.joelverhagen.com/blog/2011/02/md5-hash-of-file-in-python/ and I've tested the 2nd example directly with an explicitly typed path to a file and it returns the correct hash.
I am also using the os.path.getsize method in the same way (fSize=os.path.getsize(f)) and that is picking the file up correctly.
When I dump the contents of the f string in my code, and compare it to the explicitly typed path, I notice it lacks ' markers around the string:
/home/.../(500) Days of Summer[2009]/11 - Regina Spektor - Hero.mp3 from a 'print f'
and from the explicitly typed path:
print 'The MD5 checksum of text.txt is', md5Checksum('/home/.../deduper/test.txt') (which works)
If I manually add ' markers to the path the code falls over:
IOError: [Errno 2] No such file or directory: "'/home/.../(500) Days of Summer[2009]/11 - Regina Spektor - Hero.mp3'"
This makes me suspect I'm not passing the path correctly. I'm on a ubuntu box if that matters.
EDIT
I'm a buffoon. I've been stuck on this for a few days, and its only through posting it on here and checking the indentations I've noticed I've messed one of them up on the md5Checker method along the way... I've fixed it, and this totally works. Thank you all for making me check.... (for the record, I had the m.update(data) line inline with the break. Thats not going to work now is it.... :s
def md5Checksum(filePath):
fh = open(filePath, 'rb')
m = hashlib.md5()
# print "File being hashed: " + filePath
while True:
data = fh.read(8192)
if not data:
break
m.update(data)
return m.hexdigest()

I had somehow got the indentation misalign, not enough to cause it to fall over and give me an error, but enough for it to not work. Answer is posted in the original question.

Related

How to get everything before a character and x amount after?

It has been trial and error and can't seem to get what I want.
I am accessing an API to get some info. Unfortunately it's the only API to get that info and to do it, it downloads a binary content of a file and names it:
folder\filename.whatever
i.e. test\purpleMonkeyTest.docx
There is a bunch more info that comes in from the call but there is this line:
Saved the binary content to: /home/user/python/test\purpleMonkeyTest.docx
Some of the files have " or other special characters so I can't just get the file name and delete it as part of the script, since I won't know what to escape.
So my goal here is to strip the line and get:
/home/user/python/test\purpleMonkeyTest.docx
then get only:
/home/user/python/test\pu
then:
os.remove "/home/user/python/test\pu"*
I'm thinking that a wildcard should work for all, unless there is a better way to do it. All files saved have the character \ in them so I've got to the point where I'm getting everything prior to the \ but I want one or two characters after that as well.
Here's what I've tried:
def fileName(itemID):
import fnmatch
details = itemDetails(itemID, True) # get item id and file details
filepath = matchPattern((details), 'Saved the binary content to: *')
filepath = (filepath).split('\\')[0]
print(filepath)
#os.remove(re.escape(filepath))
return (matchPattern((details), 'Binary component: *'))
def matchPattern(details, pattern):
import fnmatch
return (fnmatch.filter((details), pattern)[0].split(": " ,1)[1])
Output:
/home/user/python/test
purpleMonkeyTest.docx
I do want the file name for later: that's actually the main goal. The API downloads the damn file automatically though.
EDIT:
Answer below works for getting the chars I want. Os remove is not removing the file though.
OSError: [Errno 2] No such file or directory: '/home/user/python/test\\Re*'
Managed to get it to work using glob, I guess os.remove doesn't support Wilds.
files = glob.glob((filepath)+"*")
for file in files:
os.remove(file)
Thanks for the help!!

As far as I understand your question you would like to retrieve 2 parts - everything between first / and \ with 2 chars afterwards and then everything after \:
str = "Saved the binary content to: /home/user/python/test\purpleMonkeyTest.docx"
print (str[str.index("/"):str.rindex("\\") + 3])
print (str[str.rindex("\\") + 1:])
Output
/home/user/python/test\pu
purpleMonkeyTest.docx

os.path.basename to outfile

For every input file processed (see code below) I am trying to use "os.path.basename" to write to a new output file - I know I am missing something obvious...?
import os
import glob
import gzip
dbpath = '/home/university/Desktop/test'
for infile in glob.glob( os.path.join(dbpath, 'G[D|E]/????/*.gz') ):
print("current file is: " + infile)
**
outfile=os.path.basename('/home/university/Desktop/test/G[D|E]
/????/??????.xaa.fastq.gz').rsplit('.xaa.fastq.gz')[0]
file=open(outfile, 'w+')
**
gzsuppl = Chem.ForwardSDMolSupplier(gzip.open(infile))
for m in gzsuppl:
if m is None: continue
...etc
file.close()
print(count)
It is not clear to me how to capture the variable [0] (i.e. everything upstream of .xaa.fastq.gz) and use as the basename for the new output file?
Unfortunately it simply writes the new output file as "??????" rather than the actual sequence of 6 letters.
Thanks for any help given.

This seems like it will get everything upstream of the .xaa.fastq.gz in the paths returned from glob() in your sample code:
import os
filepath = '/home/university/Desktop/test/GD /AAML/DEAAML.xaa.fastq.gz'
filepath = os.path.normpath(filepath) # Changes path separators for Windows.
# This section was adapted from answer https://stackoverflow.com/a/3167684/355230
folders = []
while 1:
filepath, folder = os.path.split(filepath)
if folder:
folders.append(folder)
else:
if filepath:
folders.append(filepath)
break
folders.reverse()
if len(folders) > 1:
# The last element of folders should contain the original filename.
filename_prefix = os.path.basename(folders[-1]).split('.')[0]
outfile = os.path.join(*(folders[:-1] + [filename_prefix + '.rest_of_filename']))
print(outfile) # -> \home\university\Desktop\test\GD \AAML\DEAAML.rest_of_filename
Of course what ends-up in outfile isn't the final path plus filename since I don't know what the remainder of the filename will be and just put a placeholder in (the '.rest_of_filename').

I'm not familiar with the kind of input data you're working with, but here's what I can tell you:
The "something obvious" you're missing is that outfile has no connection to infile. Your outfile line uses the ?????? rather than the actual filename because that's what you ask for. It's glob.glob that turns it into a list of matches.
Here's how I'd write that aspect of the outfile line:
outfile = infile.rsplit('.xaa.fastq.gz', 1)[0]
(The , 1 ensures that it'll never split more than once, no matter how crazy a filename gets. It's just a good habit to get into when using split or rsplit like this.)
You're setting yourself up for a bug, because the glob pattern can match *.gz files which don't end in .xaa.fastq.gz, which would mean that a random .gz file which happens to wind up in the folder listing would cause outfile to have the same path as infile and you'd end up writing to the input file.
There are three solutions to this problem which apply to your use case:
Use *.xaa.fastq.gz instead of *.gzin your glob. I don't recommend this because it's easy for a typo to sneak in and make them different again, which would silently reintroduce the bug.
Write your output to a different folder than you took your input from.
outfile = os.path.join(outpath, os.path.relpath(infile, dbpath))
outparent = os.path.dirname(outfile)
if not os.path.exists(outparent):
os.makedirs(outparent)
Add an assert outfile != infile line so the program will die with a meaningful error message in the "this should never actually happen" case, rather than silently doing incorrect things.
The indentation of what you posted could be wrong, but it looks like you're opening a bunch of files, then only closing the last one. My advice is to use this instead, so it's impossible to get that wrong:
with open(outfile, 'w+') as file:
# put things which use `file` here
The name file is already present in the standard library and the variable names you chose are unhelpful. I'd rename infile to inpath, outfile to outpath, and file to outfile. That way, you can tell whether each one is a path (ie. a string) or a Python file object just from the variable name and there's no risk of accessing file before you (re)define it and getting a very confusing error message.

zipfile cant handle some type of zip data?

I came up over this problem while trying to decompress a zip file.
-- zipfile.is_zipfile(my_file) always returns False, even though the UNIX command unzip handles it just fine. Also, when trying to do zipfile.ZipFile(path/file_handle_to_path) I get the same error
-- the file command returns Zip archive data, at least v2.0 to extract and using less on the file it shows:
PKZIP for iSeries by PKWARE
Length Method Size Cmpr Date Time CRC-32 Name
2113482674 Defl:S 204502989 90% 2010-11-01 08:39 2cee662e myfile.txt
2113482674 204502989 90% 1 file
Any ideas how can I go around this issue ? It would be nice if I could make python's zipfile work since I already have some unit tests that I'll have to drop if I'll switch to running subprocess.call("unzip")

Run into the same issue on my files and was able to solve it. I'm not sure how they were generated, like in the above example. They all had trailing data in the end ignored by both Windows by 7z and failing python's zipfile.
This is the code to solve the issue:
def fixBadZipfile(zipFile):
f = open(zipFile, 'r+b')
data = f.read()
pos = data.find('\x50\x4b\x05\x06') # End of central directory signature
if (pos > 0):
self._log("Truncating file at location " + str(pos + 22) + ".")
f.seek(pos + 22) # size of 'ZIP end of central directory record'
f.truncate()
f.close()
else:
# raise error, file is truncated

You say using less on the file it shows such and such. Do you mean this?
less my_file
If so, I would guess these are comments that the zip program put in the
file. Looking at a user guide for the iSeries PKZIP I found on the web,
this appears to be the default behavior.
The docs for zipfile say "This module does not currently handle ZIP
files which have appended comments." Perhaps this is the problem? (Of
course, if less shows them, this would seem to imply that they're
prepended, FWIW.)
It appears you (or whoever created the zipfile on an iSeries machine)
can turn this off with ARCHTEXT(*NONE), or use ARCHTEXT(*CLEAR) to
remove it from an existing zipfile.

# Utilize mmap module to avoid a potential DoS exploit (e.g. by reading the
# whole zip file into memory). A bad zip file example can be found here:
# https://bugs.python.org/issue24621
import mmap
from io import UnsupportedOperation
from zipfile import BadZipfile
# The end of central directory signature
CENTRAL_DIRECTORY_SIGNATURE = b'\x50\x4b\x05\x06'
def repair_central_directory(zipFile):
if hasattr(zipFile, 'read'):
# This is a file-like object
f = zipFile
try:
fileno = f.fileno()
except UnsupportedOperation:
# This is an io.BytesIO instance which lacks a backing file.
fileno = None
else:
# Otherwise, open the file with binary mode
f = open(zipFile, 'rb+')
fileno = f.fileno()
if fileno is None:
# Without a fileno, we can only read and search the whole string
# for the end of central directory signature.
f.seek(0)
pos = f.read().find(CENTRAL_DIRECTORY_SIGNATURE)
else:
# Instead of reading the entire file into memory, memory-mapped the
# file, then search it for the end of central directory signature.
# Reference: https://stackoverflow.com/a/21844624/2293304
mm = mmap.mmap(fileno, 0)
pos = mm.find(CENTRAL_DIRECTORY_SIGNATURE)
mm.close()
if pos > -1:
# size of 'ZIP end of central directory record'
f.truncate(pos + 22)
f.seek(0)
return f
else:
# Raise an error to make it fail fast
raise BadZipfile('File is not a zip file')

file won't write in python

I'm trying to replace a string in all the files within the current directory. for some reason, my temp file ends up blank. It seems my .write isn't working because the secondfile was declared outside its scope maybe? I'm new to python, so still climbing the learning curve...thanks!
edit: I'm aware my tempfile isn't being copied currently. I'm also aware there are much more efficient ways of doing this. I'm doing it this way for practice. If someone could answer specifically why the .write method fails to work here, that would be great. Thanks!
import os
import shutil
for filename in os.listdir("."):
file1 = open(filename,'r')
secondfile = open("temp.out",'w')
print filename
for line in file1:
line2 = line.replace('mrddb2.','shpdb2.')
line3 = line2.replace('MRDDB2.','SHPDB2.')
secondfile.write(line3)
print 'file copy in progress'
file1.close()
secondfile.close()

Just glancing at the thing, it appears that your problem is with the 'w'.
It looks like you keep overwriting, not appending.
So you're basically looping through the file(s),
and by the end you've only copied the last file to your temp file.
You'll may want to open the file with 'a' instead of 'w'.

Your code (correctly indented, though I don't think there's a way to indent it so it runs but doesn't work right) actually seems right. Keep in mind, temp.out will be the replaced contents of only the last source file. Could it be that file is just blank?

Firstly,
you have forgotten to copy the temp file back onto the original.
Secondly:
use sed -i or perl -i instead of python.
For instance:
perl -i -pe 's/mrddb2/shpdb2/;s/MRDDB2/SHPDB2/' *

I don't have the exact answer for you, but what might help is to stick some print lines in there in strategic places, like print each line before it was modified, then again after it was modified. Then place another one after the line was modified just before it is written to the file. Then just before you close the new file do a:
print secondfile.read()
You could also try to limit the results you get if there are too many for debugging purposes. You can limit string output by attaching a subscript modifier to the end, for example:
print secondfile.read()[:n]
If n = 100 it will limit the output to 100 characters.

if your code is actually indented as showed in the post, the write is working fine. But if it is failing, the write call may be outside the inner for loop.

Just to make sure I wasn't really missing something, I tested the code and it worked fine for me. Maybe you could try continue for everything but one specific filename and then check the contents of temp.out after that.
import os
for filename in os.listdir("."):
if filename != 'findme.txt': continue
print 'Processing', filename
file1 = open(filename,'r')
secondfile = open("temp.out",'w')
print filename
for line in file1:
line2 = line.replace('mrddb2.','shpdb2.')
line3 = line2.replace('MRDDB2.','SHPDB2.')
print 'About to write:', line3
secondfile.write(line3)
print 'Done with', filename
file1.close()
secondfile.close()
Also, as others have mentioned, you're just clobbering your temp.out file each time you process a new file. You've also imported shutil without actually doing anything with it. Are you forgetting to copy temp.out back to your original file?

I noticed sometimes it will not print to file if you don't have a file.close after file.write.
For example, this program never actually saves to file, it just makes a blank file (unless you add outfile.close() right after the outfile.write.)
outfile=open("ok.txt","w")
fc="filecontents"
outfile.write(fc.encode("utf-8"))
while 1:
print "working..."

#OP, you might also want to try fileinput module ( this way, you don't have to use your own temp file)
import fileinput
for filename in os.listdir("."):
for line in fileinput.FileInput(filename,inplace=1):
line = line.strip().replace('mrddb2.','shpdb2.')
line = line.strip().replace('MRDDB2.','SHPDB2.')
print line
set "inplace" to 1 for editing the file in place. Set to 0 for normal print to stdout

how do i get the byte count of a variable in python just like wc -c gives in unix

i am facing some problem with files with huge data.
i need to skip doing some execution on those files.
i get the data of the file into a variable.
now i need to get the byte of the variable and if it is greater than 102400 , then print a message.
update : i cannot open the files , since it is present in a tar file.
the content is already getting copied to a variable called 'data'
i am able to print contents of the variable data. i just need to check if it has more than 102400 bytes.
thanks

import os
length_in_bytes = os.stat('file.txt').st_size
if length_in_bytes > 102400:
print 'Its a big file!'
Update to work on files in a tarfile
import tarfile
tf = tarfile.TarFile('foo.tar')
for member in tarfile.getmembers():
if member.size > 102400:
print 'It's a big file in a tarfile - the file is called %s!' % member.name

Just check the length of the string, then:
if len(data) > 102400:
print "Skipping file which is too large, at %d bytes" % len(data)
else:
process(data) # The normal processing

If I'm understanding the question correctly, you want to skip certain input files if they're too large. For that, you can use os.path.getsize():
import os.path
if os.path.getsize('f') <= 102400:
doit();

len(data) gives you the size in bytes if it's binary data. With strings the size depends on the encoding used.

This answer seems irrelevant, since I seem to have misunderstood the question, which has now been clarified. However, should someone find this question, while searching with pretty much the same terms, this answer may still be relevant:
Just open the file in binary mode
f = open(filename, 'rb')
read/skip a bunch and print the next byte(s). I used the same method to 'fix' the n-th byte in a zillion images once.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.