SHA Hashing for training/validation/testing set split

SHA Hashing for training/validation/testing set split - python

Following is a small snippet from the full code
I am trying to understand the logical process of this methodology of split.
SHA1 encoding is 40 characters in hexadecimal. What kind of probability has been computed in the expression ?
What is the reason for (MAX_NUM_IMAGES_PER_CLASS + 1) ? Why add 1 ?
Does setting different values to MAX_NUM_IMAGES_PER_CLASS have an effect on the split quality ?
How good a quality of split would we get out of this ? Is this is a recommended way of splitting datasets ?
# We want to ignore anything after '_nohash_' in the file name when
# deciding which set to put an image in, the data set creator has a way of
# grouping photos that are close variations of each other. For example
# this is used in the plant disease data set to group multiple pictures of
# the same leaf.
hash_name = re.sub(r'_nohash_.*$', '', file_name)
# This looks a bit magical, but we need to decide whether this file should
# go into the training, testing, or validation sets, and we want to keep
# existing files in the same set even if more files are subsequently
# added.
# To do that, we need a stable way of deciding based on just the file name
# itself, so we do a hash of that and then use that to generate a
# probability value that we use to assign it.
hash_name_hashed = hashlib.sha1(compat.as_bytes(hash_name)).hexdigest()
percentage_hash = ((int(hash_name_hashed, 16) %
(MAX_NUM_IMAGES_PER_CLASS + 1)) *
(100.0 / MAX_NUM_IMAGES_PER_CLASS))
if percentage_hash < validation_percentage:
validation_images.append(base_name)
elif percentage_hash < (testing_percentage + validation_percentage):
testing_images.append(base_name)
else:
training_images.append(base_name)
result[label_name] = {
'dir': dir_name,
'training': training_images,
'testing': testing_images,
'validation': validation_images,
}

This code is simply distributing file names “randomly” (but reproducibly) over a number of bins and then grouping the bins into just the three categories. The number of bits in the hash is irrelevant (so long as it’s “enough”, which is probably about 35 for this sort of work).
Reducing modulo n+1 produces a value on [0,n], and multiplying that by 100/n obviously produces a value on [0,100], which is being interpreted as a percentage. n being MAX_NUM_IMAGES_PER_CLASS is meant to control the rounding error in the interpretation to be no more than “one image”.
This strategy is reasonable, but looks a bit more sophisticated than it is (since there is still rounding going on, and the remainder introduces a bias—although with numbers this large it is utterly unobservable). You could make it simpler and more accurate by simply precalculating ranges over the whole space of 2^160 hashes for each class and just checking the hash against the two boundaries. That still notionally involves rounding, but with 160 bits it’s only that intrinsic to representing decimals like 31% in floating point.

Related

How to fix negative values in log?

So, I am getting the data from a txt file and I want to get specific data within the whole set. In the code, I am trying to grab it by specifying which indexes and which frequencies are being used for those indexes. But my log is showing a negative value and I don't how to fix that. Code is below, thanks!
indexes = [9,10,11,12,13]
frequenciesmh = [151,610,1400,4860,18000]
frequenciesgh = [i*10**-3 for i in frequenciesmh]
bigclusterallfluxes = bigcluster[indexes]
bigclusterlogflux151mhandredshift = [i[indexes] for i in bigcluster]
shiftedlogflux151mh =
[np.interp(np.log10((151*10**-3)*i[0]),np.log10(frequenciesgh),i[1:])
for i in bigclusterlogflux151mhandredshift]
shiftflux151mh = [10**i for i in shiftedlogflux151mh]
bigclusterflux151mhandredshift =
np.array(list(zip(shiftflux151mh,np.transpose(bigcluster)[9])))

I don't know what you are trying to fix exactly, but I would definitely NOT change the negative values as they would change the power to being positive always (if you know some maths you will understand that that means 1/16 ==> 16 and also 16 ==> 16).
What you probably want, as you are working with frequencies (Which are always between 0 and 1, if you normalize them, to do this divide each of them by the sum of all of them, hence your logarithm will always be smaller or equal to 0) is to make them all positive and have the - log 10 of your probability, which is quite a common value to have, then 1 == 1/10, 2 == 1/100, etc (which in genetics at least are called phred values I believe).
Summarizing always call the minus log, not the log
-math.log(0.0001)

The abs() function is what you are looking for.

Detect significant changes in a data-set that gradually changes

I have a list of data in python that represents amount of resources used per minute. I want to find the number of times it changes significantly in that data set. What I mean by significant change is a bit different from what I've read so far.
For e.g. if I have a dataset like
[10,15,17,20,30,40,50,70,80,60,40,20]
I say a significant change happens when data increases by double or reduces by half with respect to the previous normal.
For e.g. since the list starts with 10, that is our starting normal point
Then when data doubles to 20, I count that as one significant change and set the normal to 20.
Then when data doubles to 40, it is considered a significant change and the normal is now 40
Then when data doubles to 80, it is considered a significant change and the normal is now 80
After that when data reduces by half to 40, it is considered as another significant change and the normal becomes 40
Finally when data reduces by half to 20, it is the last significant change
Here there are a total of 5 significant changes.
Is it similar to any other change detection algorithm? How can this be done efficiently in python?

This is relatively straightforward. You can do this with a single iteration through the list. We simply update our base when a 'significant' change occurs.
Note that my implementation will work for any iterable or container. This is useful if you want to, for example, read through a file without having to load it all into memory.
def gen_significant_changes(iterable, *, tol = 2):
iterable = iter(iterable) # this is necessary if it is container rather than generator.
# note that if the iterable is already a generator iter(iterable) returns itself.
base = next(iterable)
for x in iterable:
if x >= (base * tol) or x <= (base/tol):
yield x
base = x
my_list = [10,15,17,20,30,40,50,70,80,60,40,20]
print(list(gen_significant_changes(my_list)))

I can't help with the Python part, but in terms of math, the problem you're asking is fairly simple to solve using log base 2. A significant change occurs when the current value divided by a constant can be reached by raising 2 to a different power (as an integer) than the previous value. (The constant is needed since the first value in the array forms the basis of comparison.)
For each element at t, compute:
current = math.log(Array[t] /Array[0], 2)
previous = math.log(Array[t-1]/Array[0], 2)
if math.floor(current) <> math.floor(previous) a significant change has occurred
Using this method you do not need to keep track of a "normal point" at all, you just need the array. By removing the additional state variable we enable the array to be processed in any order, and we could give portions of the array to different threads if the dataset were very large. You wouldn't be able to do that with your current method.

Reversible hash function?

I need a reversible hash function (obviously the input will be much smaller in size than the output) that maps the input to the output in a random-looking way. Basically, I want a way to transform a number like "123" to a larger number like "9874362483910978", but not in a way that will preserve comparisons, so it must not be always true that, if x1 > x2, f(x1) > f(x2) (but neither must it be always false).
The use case for this is that I need to find a way to transform small numbers into larger, random-looking ones. They don't actually need to be random (in fact, they need to be deterministic, so the same input always maps to the same output), but they do need to look random (at least when base64encoded into strings, so shifting by Z bits won't work as similar numbers will have similar MSBs).
Also, easy (fast) calculation and reversal is a plus, but not required.
I don't know if I'm being clear, or if such an algorithm exists, but I'd appreciate any and all help!

None of the answers provided seemed particularly useful, given the question. I had the same problem, needing a simple, reversible hash for not-security purposes, and decided to go with bit relocation. It's simple, it's fast, and it doesn't require knowing anything about boolean maths or crypo algorithms or anything else that requires actual thinking.
The simplest would probably be to just move half the bits left, and the other half right:
def hash(n):
return ((0x0000FFFF & n)<<16) + ((0xFFFF0000 & n)>>16)
This is reversible, in that hash(hash(n)) = n, and has non-sequential pairs {n,m}, n < m, where hash(m) < hash(n).
And to get a much less sequential looking implementation, you might also want to consider an interlace reordering from [msb,z,...,a,lsb] to [msb,lsb,z,a,...] or [lsb,msb,a,z,...] or any other relocation you feel gives an appropriately non-sequential sequence for the numbers you deal with, or even add a XOR on top for peak desequential'ing.
(The above function is safe for numbers that fit in 32 bits, larger numbers are guaranteed to cause collisions and would need some more bit mask coverage to prevent problems. That said, 32 bits is usually enough for any non-security uid).
Also have a look at the multiplicative inverse answer given by Andy Hayden, below.

Another simple solution is to use multiplicative inverses (see Eri Clippert's blog):
we showed how you can take any two coprime positive integers x and m and compute a third positive integer y with the property that (x * y) % m == 1, and therefore that (x * z * y) % m == z % m for any positive integer z. That is, there always exists a “multiplicative inverse”, that “undoes” the results of multiplying by x modulo m.
We take a large number e.g. 4000000000 and a large co-prime number e.g. 387420489:
def rhash(n):
return n * 387420489 % 4000000000
>>> rhash(12)
649045868
We first calculate the multiplicative inverse with modinv which turns out to be 3513180409:
>>> 3513180409 * 387420489 % 4000000000
1
Now, we can define the inverse:
def un_rhash(h):
return h * 3513180409 % 4000000000
>>> un_rhash(649045868) # un_rhash(rhash(12))
12
Note: This answer is fast to compute and works for numbers up to 4000000000, if you need to handle larger numbers choose a sufficiently large number (and another co-prime).
You may want to do this with hexidecimal (to pack the int):
def rhash(n):
return "%08x" % (n * 387420489 % 4000000000)
>>> rhash(12)
'26afa76c'
def un_rhash(h):
return int(h, 16) * 3513180409 % 4000000000
>>> un_rhash('26afa76c') # un_rhash(rhash(12))
12
If you choose a relatively large co-prime then this will seem random, be non-sequential and also be quick to calculate.

What you are asking for is encryption. A block cipher in its basic mode of operation, ECB, reversibly maps a input block onto an output block of the same size. The input and output blocks can be interpreted as numbers.
For example, AES is a 128 bit block cipher, so it maps an input 128 bit number onto an output 128 bit number. If 128 bits is good enough for your purposes, then you can simply pad your input number out to 128 bits, transform that single block with AES, then format the output as a 128 bit number.
If 128 bits is too large, you could use a 64 bit block cipher, like 3DES, IDEA or Blowfish.
ECB mode is considered weak, but its weakness is the constraint that you have postulated as a requirement (namely, that the mapping be "deterministic"). This is a weakness, because once an attacker has observed that 123 maps to 9874362483910978, from then on whenever she sees the latter number, she knows the plaintext was 123. An attacker can perform frequency analysis and/or build up a dictionary of known plaintext/ciphertext pairs.

Basically, you are looking for 2 way encryption, and one that probably uses a salt.
You have a number of choices:
TripleDES
AES
Here is an example:" Simple insecure two-way "obfuscation" for C#
What language are you looking at? If .NET then look at the encryption namespace for some ideas.

Why not just XOR with a nice long number?
Easy. Fast. Reversible.
Or, if this doesn't need to be terribly secure, you could convert from base 10 to some smaller base (like base 8 or base 4, depending on how long you want the numbers to be).

python parallel computing: split keyspace to give each node a range to work on

My question is rather complicated for me to explain, as i'm not really good at maths, but i'll try to be as clear as possible.
I'm trying to code a cluster in python, which will generate words given a charset (i.e. with lowercase: aaaa, aaab, aaac, ..., zzzz) and make various operations on them.
I'm searching how to calculate, given the charset and the number of nodes, what range each node should work on (i.e.: node1: aaaa-azzz, node2: baaa-czzz, node3: daaa-ezzz, ...). Is it possible to make an algorithm that could compute this, and if it is, how could i implement this in python?
I really don't know how to do that, so any help would be much appreciated

Any way that you could compute a small integer from the string would be fine for clustering. For example, compute a hash with md5, and look at a byte of it:
import hashlib
s = "aaac"
num_nodes = 5 # or whatever
m = hashlib.md5(s)
node = ord(m.digest()[0]) % num_nodes
print node # prints 2
This won't guarantee to evenly distribute all the strings, but it will be close.

You should be able to treat your words as numerals in a strange base. For example, let's say you have a..z as your charset (26 characters), 4 character strings, and you want to distribute among equally 10 machines. Then there are a total of 26^4 strings, so each machine gets 26^4/10 strings. The first machine will get strings 0 through 26^4/10, the next 26^4/10 through 26^4/5, etc.
To convert the numbers to strings, just write the number in base 26 using your charset as the numbers. So 0 is 'aaaa' and 26^4/10 = 2*26^3 + 15*26^2 + 15*26 +15 is 'cppp'.

In what contexts do programming languages make real use of an Infinity value?

So in Ruby there is a trick to specify infinity:
1.0/0
=> Infinity
I believe in Python you can do something like this
float('inf')
These are just examples though, I'm sure most languages have infinity in some capacity. When would you actually use this construct in the real world? Why would using it in a range be better than just using a boolean expression? For instance
(0..1.0/0).include?(number) == (number >= 0) # True for all values of number
=> true
To summarize, what I'm looking for is a real world reason to use Infinity.
EDIT: I'm looking for real world code. It's all well and good to say this is when you "could" use it, when have people actually used it.

Dijkstra's Algorithm typically assigns infinity as the initial edge weights in a graph. This doesn't have to be "infinity", just some arbitrarily constant but in java I typically use Double.Infinity. I assume ruby could be used similarly.

Off the top of the head, it can be useful as an initial value when searching for a minimum value.
For example:
min = float('inf')
for x in somelist:
if x<min:
min=x
Which I prefer to setting min initially to the first value of somelist
Of course, in Python, you should just use the min() built-in function in most cases.

There seems to be an implied "Why does this functionality even exist?" in your question. And the reason is that Ruby and Python are just giving access to the full range of values that one can specify in floating point form as specified by IEEE.
This page seems to describe it well:
http://steve.hollasch.net/cgindex/coding/ieeefloat.html
As a result, you can also have NaN (Not-a-number) values and -0.0, while you may not immediately have real-world uses for those either.

In some physics calculations you can normalize irregularities (ie, infinite numbers) of the same order with each other, canceling them both and allowing a approximate result to come through.
When you deal with limits, calculations like (infinity / infinity) -> approaching a finite a number could be achieved. It's useful for the language to have the ability to overwrite the regular divide-by-zero error.

Use Infinity and -Infinity when implementing a mathematical algorithm calls for it.
In Ruby, Infinity and -Infinity have nice comparative properties so that -Infinity < x < Infinity for any real number x. For example, Math.log(0) returns -Infinity, extending to 0 the property that x > y implies that Math.log(x) > Math.log(y). Also, Infinity * x is Infinity if x > 0, -Infinity if x < 0, and 'NaN' (not a number; that is, undefined) if x is 0.
For example, I use the following bit of code in part of the calculation of some log likelihood ratios. I explicitly reference -Infinity to define a value even if k is 0 or n AND x is 0 or 1.
Infinity = 1.0/0.0
def Similarity.log_l(k, n, x)
unless x == 0 or x == 1
k * Math.log(x.to_f) + (n-k) * Math.log(1.0-x)
end
-Infinity
end
end

Alpha-beta pruning

I use it to specify the mass and inertia of a static object in physics simulations. Static objects are essentially unaffected by gravity and other simulation forces.

In Ruby infinity can be used to implement lazy lists. Say i want N numbers starting at 200 which get successively larger by 100 units each time:
Inf = 1.0 / 0.0
(200..Inf).step(100).take(N)
More info here: http://banisterfiend.wordpress.com/2009/10/02/wtf-infinite-ranges-in-ruby/

I've used it for cases where you want to define ranges of preferences / allowed.
For example in 37signals apps you have like a limit to project number
Infinity = 1 / 0.0
FREE = 0..1
BASIC = 0..5
PREMIUM = 0..Infinity
then you can do checks like
if PREMIUM.include? current_user.projects.count
# do something
end

I used it for representing camera focus distance and to my surprise in Python:
>>> float("inf") is float("inf")
False
>>> float("inf") == float("inf")
True
I wonder why is that.

I've used it in the minimax algorithm. When I'm generating new moves, if the min player wins on that node then the value of the node is -∞. Conversely, if the max player wins then the value of that node is +∞.
Also, if you're generating nodes/game states and then trying out several heuristics you can set all the node values to -∞/+∞ which ever makes sense and then when you're running a heuristic its easy to set the node value:
node_val = -∞
node_val = max(heuristic1(node), node_val)
node_val = max(heuristic2(node), node_val)
node_val = max(heuristic2(node), node_val)

I've used it in a DSL similar to Rails' has_one and has_many:
has 0..1 :author
has 0..INFINITY :tags
This makes it easy to express concepts like Kleene star and plus in your DSL.

I use it when I have a Range object where one or both ends need to be open

I've used symbolic values for positive and negative infinity in dealing with range comparisons to eliminate corner cases that would otherwise require special handling:
Given two ranges A=[a,b) and C=[c,d) do they intersect, is one greater than the other, or does one contain the other?
A > C iff a >= d
A < C iff b <= c
etc...
If you have values for positive and negative infinity that respectively compare greater than and less than all other values, you don't need to do any special handling for open-ended ranges. Since floats and doubles already implement these values, you might as well use them instead of trying to find the largest/smallest values on your platform. With integers, it's more difficult to use "infinity" since it's not supported by hardware.

I ran across this because I'm looking for an "infinite" value to set for a maximum, if a given value doesn't exist, in an attempt to create a binary tree. (Because I'm selecting based on a range of values, and not just a single value, I quickly realized that even a hash won't work in my situation.)
Since I expect all numbers involved to be positive, the minimum is easy: 0. Since I don't know what to expect for a maximum, though, I would like the upper bound to be Infinity of some sort. This way, I won't have to figure out what "maximum" I should compare things to.
Since this is a project I'm working on at work, it's technically a "Real world problem". It may be kindof rare, but like a lot of abstractions, it's convenient when you need it!
Also, to those who say that this (and other examples) are contrived, I would point out that all abstractions are somewhat contrived; that doesn't mean they are useful when you contrive them.

When working in a problem domain where trig is used (especially tangent) infinity is an answer that can come up. Trig ends up being used heavily in graphics applications, games, and geospatial applications, plus the obvious math applications.

I'm sure there are other ways to do this, but you could use Infinity to check for reasonable inputs in a String-to-Float conversion. In Java, at least, the Float.isNaN() static method will return false for numbers with infinite magnitude, indicating they are valid numbers, even though your program might want to classify them as invalid. Checking against the Float.POSITIVE_INFINITY and Float.NEGATIVE_INFINITY constants solves that problem. For example:
// Some sample values to test our code with
String stringValues[] = {
"-999999999999999999999999999999999999999999999",
"12345",
"999999999999999999999999999999999999999999999"
};
// Loop through each string representation
for (String stringValue : stringValues) {
// Convert the string representation to a Float representation
Float floatValue = Float.parseFloat(stringValue);
System.out.println("String representation: " + stringValue);
System.out.println("Result of isNaN: " + floatValue.isNaN());
// Check the result for positive infinity, negative infinity, and
// "normal" float numbers (within the defined range for Float values).
if (floatValue == Float.POSITIVE_INFINITY) {
System.out.println("That number is too big.");
} else if (floatValue == Float.NEGATIVE_INFINITY) {
System.out.println("That number is too small.");
} else {
System.out.println("That number is jussssst right.");
}
}
Sample Output:
String representation: -999999999999999999999999999999999999999999999
Result of isNaN: false
That number is too small.
String representation: 12345
Result of isNaN: false
That number is jussssst right.
String representation: 999999999999999999999999999999999999999999999
Result of isNaN: false
That number is too big.

It is used quite extensively in graphics. For example, any pixel in a 3D image that is not part of an actual object is marked as infinitely far away. So that it can later be replaced with a background image.

I'm using a network library where you can specify the maximum number of reconnection attempts. Since I want mine to reconnect forever:
my_connection = ConnectionLibrary(max_connection_attempts = float('inf'))
In my opinion, it's more clear than the typical "set to -1 to retry forever" style, since it's literally saying "retry until the number of connection attempts is greater than infinity".

Some programmers use Infinity or NaNs to show a variable has never been initialized or assigned in the program.

If you want the largest number from an input but they might use very large negatives. If I enter -13543124321.431 it still works out as the largest number since it's bigger than -inf.
enter code here
initial_value = float('-inf')
while True:
try:
x = input('gimmee a number or type the word, stop ')
except KeyboardInterrupt:
print("we done - by yo command")
break
if x == "stop":
print("we done")
break
try:
x = float(x)
except ValueError:
print('not a number')
continue
if x > initial_value: initial_value = x
print("The largest number is: " + str(initial_value))

You can to use:
import decimal
decimal.Decimal("Infinity")
or:
from decimal import *
Decimal("Infinity")

For sorting
I've seen it used as a sort value, to say "always sort these items to the bottom".

To specify a non-existent maximum
If you're dealing with numbers, nil represents an unknown quantity, and should be preferred to 0 for that case. Similarly, Infinity represents an unbounded quantity, and should be preferred to (arbitrarily_large_number) in that case.
I think it can make the code cleaner. For example, I'm using Float::INFINITY in a Ruby gem for exactly that: the user can specify a maximum string length for a message, or they can specify :all. In that case, I represent the maximum length as Float::INFINITY, so that later when I check "is this message longer than the maximum length?" the answer will always be false, without needing a special case.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

SHA Hashing for training/validation/testing set split - python

Related

How to fix negative values in log?

Detect significant changes in a data-set that gradually changes

Reversible hash function?

python parallel computing: split keyspace to give each node a range to work on

In what contexts do programming languages make real use of an Infinity value?

Categories

Resources