Unique file name generation out of a couple of variables in python - python

I am running some calculations in python that are time consuming and I would like to serialize the intermediate results.
My problem is the following: each calculation is configured by multiple parameters: a couple of numbers and strings. Of course, I can concatenate everything, but it will be also extremely long string and I am afraid that it will exceed allowed length of a file name.
Any ideas how to cope with this?

An easy way would be to use md5 (e.g. https://docs.python.org/2/library/md5.html)
import md5
tmp=md5.new()
tmp.update(<parameter1>)
...
filename=tmp.hexdigest()
This should generate filenames which are unique enough. You can add the current timestamp as a parameter to raise uniqueness.

Related

How to calculate same hash for two similar files?

I want to apply an Hashing algorithm, where the hash is same, If two files are similar. If one bit is lost, the hash of files change. Any algorithm which I can apply in Python to tackle this problem?
Thank you
I heard block hasing do this, but I don't know how to appply that.
I applied the following algorithm, but it does not help
import hashlib
file = "Annotation 2020-04-09 163448.png" # Location of the file (can be set a different way)
BLOCK_SIZE = 65536 # The size of each read from the file
file_hash = hashlib.sha256() # Create the hash object, can use something other than `.sha256()` if you wish
with open(file, 'rb') as f: # Open the file to read it's bytes
fb = f.read(BLOCK_SIZE) # Read from the file. Take in the amount declared above
while len(fb) > 0: # While there is still data being read from the file
file_hash.update(fb) # Update the hash
fb = f.read(BLOCK_SIZE) # Read the next block from the file
print (file_hash.hexdigest()) # Get the hexadecimal digest of the hash
The entire point of hashing algorithms is that they become completely different if any one bit from the source files is different, to ensure that generating hash collisions becomes challenging. Here are some workarounds:
The only robust way to find "similar" but not the same files you need to either compare the entire file content for every part to compute a similarity score. This is rather inefficient however, since it would be a O(n^2) algorithm with frequent hard drive roundtrips.
Another way is to perhaps hash only a part of each file. This will have the same problems that if only one bit of this part is different, the file will be different. However, you may be able to ignore perhaps spaces or markup or capitalization or hash only the file headers or ignore the last few bits of every color value, there are plenty of options for removing less relevant data to create less precise hashes. You could use block hashing here as a small optimization to avoid repeatedly loading big files, and first checking if enough blocks are similar.
You can also combine these techniques, use a hash to check if at least the basic file metadata is correct in a fast way, and then use a more slow algorithm to compute a similarity score only if the hashes match. This combines some of the accuracy of approach one with some of the speed of approach 2, though both the accuracy and the speed will still not be great.
The final option is to use a very weak hashing algorithm. If you just use sum(file)%(2^32), similar files will give sortof similar hashes in some cases, but it's really hard to determine actual similarity based on the final hash, since a difference of a byte anywhere in the file will make a big difference in the hash still, and if you include all hashes within 256 of each other, many files will still be considered similar even if they are not and you miss all files with a difference of two bytes or more.
It depends on your use case which of these techniques work for you, but beware that this is not an easy task. Good luck!

How to generate the shortest possible (alpha)numeric unique ID out of a file path?

I would like to generate numeric or alphanumeric (whichever is easier) unique ID as a function of a file path in Python. I am working on a file parsing application and there is a file entity in the DB with descendants and, in order to have a more compact foreign/primary key than the fully qualified path to a file, I would like to convert it into the shortest possible unique digest as possible.
What are my options to do this? Can I use SHA?
How about if i just took an MD5 checksum out of the fully qualified path string and got something like 1736622845? On a command line, it can be done with
echo -n '/my/path/filename' | cksum | cut -d' ' -f1
Is that guaranteed to never repeat for two different inputs? If yes, how would I translate the above bash piped command into pure Python so that I don't have to invoke a system call but get the same value?
The shortest possible unique ID of a string is the string.
You can try to use an alphabet that only contains the characters allowed in the path, so that you use less bits (a lot of work, not a lot of benefit, unless your paths really only contain a few characters)
What I think you want is a fairly good short hash function. As soon as you generate a hash function there's a risk of collision. For most hash functions a good rule of thumb is that you have far less entries than the hash value space. There's a theorem to prove that as soon as you have more than sqrt(key_space) entries you will (with the best hashes) get collisions half the time.
So if you take say 1000 paths, you should aim at having a hash pace of at least 1.000.000 entries to work with. You can chop up other hash functions (say take only the first 2 bytes of the md5). That should work, but note the increase in collisions (where 2 entries will generate the same value).
Also if you are so keen to save space, store the hash value in binary (large int). It's far shorter than the the usual encodings (base64, or hex) and all the DB functions should work fine.
So say you take md5 and store it as a large int, it will take only 16 bytes to store. But you can also only use 8 or 4 (I wouldn't dare go lower than that).

Hashing strings for load balancing

There's a great deal of information I can find on hashing strings for obfuscation or lookup tables, where collision avoidance is a primary concern. I'm trying to put together a hashing function for the purpose of load balancing, where I want to fit an unknown set of strings into an arbitrarily small number of buckets with a relatively even distribution. Collisions are expected (desired, even).
My immediate use case is load distribution in an application, where I want each instance of the application to fire at a different time of the half-hour, without needing any state information about other instances. So I'm trying to hash strings into integer values from 0 to 29. However, the general approach has wider application with different int ranges for different purposes.
Can anyone make suggestions, or point me to docs that would cover this little corner of hash generation?
My language of choice for this is python, but I can read most common langues so anything should be applicable.
Your might consider something simple, like the adler32() algo, and just mod for bucket size.
import zlib
buf = 'arbitrary and unknown string'
bucket = zlib.adler32(buf) % 30
# at this point bucket is in the range 0 - 29

Cache results of a time-intensive operation

I have a program (PatchDock), which takes its input from a parameters file, and produces an output file. Running this program is time-intensive, and I'd like to cache results of past runs so that I need not run the same parameters twice.
I'm able to parse the input and output files into appropriate data structures. The input file, for example, is parsed into a dictionary-like object. The input keys are all strings, and the values are primitive data types (ints, strings, and floats).
My approach
My first idea was to use the md5 hash of the input file as the keys in a shelve database. However, this fails to capture cached files with the exact same inputs, but some slight differences in the input files (comments, spacing, order of parameters, et cetera).
Hashing the parsed parameters seems like the best approach to me. But the only way I can think of getting a unique hash from a dictionary is to hash a sorted string representation.
Question
Hashing a string representation of a parameters dictionary seems like a roundabout way of achieving my end goal- keying unique input values to a specified output. Is there a more straightforward way to achieve this caching system?
Ideally, I'm looking to achieve this in Python.
Hashing a sorted representation of the parsed input is actually the most straightforward way of doing this, and the one that makes sense. Your instincts were correct.
Basically, you're normalizing the input (by parsing it and sorting it), and then using that to construct a hash key.
Hashing seems a very viable way, but doing this yourself seems a bit overkill. Why not use the tuple of inputs as key for your dictionary? You wouldn't have to worry about hashing and possible collisions yourself. All you have to do is fix a order for the keyword arguments (and depending on your requirements add a flag-object for keywords that are not set).
You also might find the functools.lru_cache useful, if you are using Python 3.2+.
This is a decorator that will enable caching for the last n calls of the decorated function.
If you are using a older version there are backports of this functionality out there.
Also there seem to be a project with similar goals called FileDict which might be worth looking at.

How to design a memory and computationally intensive program to run on Google App Engine

I have a problem with my code running on google app engine. I dont know how to modify my code to suit GAE. The following is my problem
for j in range(n):
for d in range(j):
for d1 in range(d):
for d2 in range(d1):
# block which runs in O(n^2)
Efficiently the entire code block is O(N^6) and it will run for more than 10 mins depending on n. Thus I am using task queues. I will also be needing a 4 dimensional array which is stored as a list (eg A[j][d][d1][d2]) of n x n x n x n ie needs memory space O(N^4)
Since the limitation of put() is 10 MB, I cant store the entire array. So I tried chopping into smaller chunks and store it and when retrieve combine them. I used the json function for this but it doesnt support for larger n (> 40).
Then I stored the whole matrix as individual entities of lists in datastore ie each A[j][d][d1] entity. So there is no local variable. When i access A[j][d][d1][d2] in my code I would call my own functions getitem and putitem to get and put data from datastore (used caching also). As a result, my code takes more time for computation. After few iterations, I get the error 203 raised by GAE and task fails with code 500.
I know that my code may not be best suited for GAE. But what is the best way to implement it on GAE ?
There may be even more efficient ways to store your data and to iterate over it.
Questions:
What datatype are you storing, list of list ... of int?
What range of the nested list does your innermost loop O(n^2) portion typically operate over?
When you do the putitem, getitem how many values are you retrieving in a single put or get?
Ideas:
You could try compressing your json (and base64 for cut and pasting). 'myjson'.encode('zlib').encode('base64')
Using a divide and conquer (map reduce) as #Robert suggested. You may be able to use a dictionary with tuples for keys, this may be fewer lookups then A[j][d][d1][d2] in your inner loop. It would also allow you to sparsly populate your structure. You would need to track and know your bounds of what data you loaded in another way. A[j][d][d1][d2] becomes D[(j,d,d1,d2)] or D[j,d,d1,d2]
You've omitted important details like the expected size of n from your question. Also, does the "# block which runs in O(n^2)" need access to the entire matrix, or are you simply populating the matrix based on the index values?
Here is a general answer: you need to find a way to break this up into smaller chunks. Maybe you can use some type of divide and conquer strategy and use tasks for parallelism. How you store your matrix depends on how you split the problem up. You might be able to store submatrices, or perhaps subvectors using the index values as key-names; again, this will depend on your problem and the strategy you use.
An alternative, if for some reason you can not figure out how to parallelize your algorithm, is to use a continuation strategy of some type. In other works, figure out about how many iterations you can typically do within the time constraints (leaving a safety margin), then once you hit that limit save your data and insert a new task to continue the processing. You'll just need to pass in the starting position, then resume running from there. You may be able to do this easily by giving a starting parameter to the outermost range, but again it depends on the specifics of your problem.
Sam, just give you an idea and pointer on where to start.
If what you need is somewhere between storing the whole matrix and storing the numbers one-by-one, may be you will be interested to use pickle to serialize your list, and store them in datastore for later retrieval.
list is a python object, and you should be able to serialize it.
http://appengine-cookbook.appspot.com/recipe/how-to-put-any-python-object-in-a-datastore

Categories