Normalize with respect to row and column - python

I have an array of probabilities. I would like the columns to sum to 1 (representing probability) and the rows to sum to X (where X is an integer, say 9 for example).
I thought that I could normalize the columns, and then normalize the rows and times by X. But this didn't work, the resulting sums of the rows and columns were not perfectly 1.0 and X.
This is what I tried:
# B is 5 rows by 30 columns
# Normalizing columns to 1.0
col_sum = []
for col in B.T:
col_sum.append(sum(col))
for row in range(B.shape[0]):
for col in range(B.shape[1]):
if B[row][col] != 0.0 and B[row][col] != 1.0:
B[row][col] = (B[row][col] / col_sum[col])
# Normalizing rows to X (9.0)
row_sum = []
for row in B:
row_sum.append(sum(row))
for row in range(B.shape[0]):
for col in range(B.shape[1]):
if B[row][col] != 0.0 and B[row][col] != 1.0:
B[row][col] = (B[row][col] / row_sum[row]) * 9.0

I'm not sure if I understood correctly, but it seems like what you're trying to accomplish might mathematically not be feasible?
Imagine you have a 2x2 matrix where you want the rows to sum up to 1 and the columns to 10. Even if you made all the numbers in the columns 1 (their max possible value) you would still not be able to sum them up to 10 in their columns?

This can only work if your matrix's number of columns is X times the number of rows. For example, if X = 3 and you have 5 rows, then you must have 15 columns. So, you could make your 5x30 matrix work for X=6 but not X=9.
The reason for this is that, if each column sums up to 1.0, the total of all values in the matrix will be 1.0 times the number of columns. And since you want each row to sum up to X, then the total of all values must also be X times the number of rows.
So: Columns * 1.0 = X * Rows
If that constraint is met, you only have to adjust all values proportionally to X/sum(row) and both dimensions will work automatically unless the initial values are not properly balanced. If the matrix is not already balanced, adjusting the values would be similar to solving a sudoku (allegedly an NP problem) and the result would largely be unrelated to the initial values. The matrix is balanced when all rows, adjusted to have the same sum, result in all columns having the same sum.
[0.7, 2.1, 1.4, 0.7, 1.4, 1.4, 0.7, 1.4, 1.4, 2.1, 0.7, 2.1, 1.4, 2.1, 1.4] 21
[2.8, 1.4, 0.7, 2.1, 1.4, 2.1, 0.7, 1.4, 2.1, 1.4, 0.7, 0.7, 1.4, 0.7, 1.4] 21
[1.4, 1.4, 1.4, 1.4, 1.4, 1.4, 1.4, 1.4, 1.4, 0.7, 2.8, 0.7, 0.7, 1.4, 2.1] 21
[1.4, 1.4, 1.4, 1.4, 2.1, 1.4, 1.4, 1.4, 0.7, 0.7, 2.1, 1.4, 1.4, 1.4, 1.4] 21
[0.7, 0.7, 2.1, 1.4, 0.7, 0.7, 2.8, 1.4, 1.4, 2.1, 0.7, 2.1, 2.1, 1.4, 0.7] 21
apply x = x * 3 / 21 to all elements ...
[0.1, 0.3, 0.2, 0.1, 0.2, 0.2, 0.1, 0.2, 0.2, 0.3, 0.1, 0.3, 0.2, 0.3, 0.2] 3.0
[0.4, 0.2, 0.1, 0.3, 0.2, 0.3, 0.1, 0.2, 0.3, 0.2, 0.1, 0.1, 0.2, 0.1, 0.2] 3.0
[0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.1, 0.4, 0.1, 0.1, 0.2, 0.3] 3.0
[0.2, 0.2, 0.2, 0.2, 0.3, 0.2, 0.2, 0.2, 0.1, 0.1, 0.3, 0.2, 0.2, 0.2, 0.2] 3.0
[0.1, 0.1, 0.3, 0.2, 0.1, 0.1, 0.4, 0.2, 0.2, 0.3, 0.1, 0.3, 0.3, 0.2, 0.1] 3.0
[1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0]

Related

Pandas Group By column to generate quantiles (.25, 0.5, .75)

Let's say we have CityName, Min-Temperature, Max-Temperature, Humidity of different cities.
We need an output dataframe grouped on CityName and want to generate 0.25, 0.5 and 0.75 quantiles. New column names would be OldColunmName + ('Q1)/('Q2')/('Q3').
Example INPUT
df = pd.DataFrame({'cityName': pd.Categorical(['a','a','a','a','b','b','b','b','a','a','a','a','b','b','b','b']),
'MinTemp': [1.1, 2.1, 3.1, 1.1, 2, 2.1, 2.2, 2.4, 2.5, 1.11, 1.31, 2.1, 1, 2, 2.3, 2.1],
'MaxTemp': [2.1, 4.2, 5.1, 2.13, 4, 3.1, 5.2, 3.4, 3.5, 2.11, 2.31, 3.1, 2, 4.3, 4.3, 3.1],
'Humidity': [0.29, 0.19, .45, 0.1, 0.1, 0.1, 0.2, 0.5, 0.11, 0.31, 0.1, .1, .2, 0.3, 0.3, 0.1]
})
OUTPUT
First Approach
First you have to group your data on the column you want which is 'cityName'. Then, because on each column you want to do multiple and different kinds of aggregations, you can use 'agg' function. For functions in the 'agg', you cannot give parameters so you define them as follow:
def quantile_50(x):
return x.quantile(0.5)
def quantile_25(x):
return x.quantile(0.25)
def quantile_75(x):
return x.quantile(0.75)
quantile_df = df.groupby('cityName').agg([quantile_25, quantile_50, quantile_75])
quantile_df
Second Approach
You can use describe method and select the statistics you need. By using idx you can choose which subindex to choose.
idx = pd.IndexSlice
df.groupby('cityName').describe().loc[:, idx[:, ['25%', '50%', '75%']]]

How can a tensor in tensorflow be sliced ​using elements of another array as an index?

I'm looking for a similar function to tf.unsorted_segment_sum, but I don't want to sum the segments, I want to get every segment as a tensor.
So for example, I have this code:
(In real, I have a tensor with shapes of (10000, 63), and the number of segments would be 2500)
to_be_sliced = tf.constant([[0.1, 0.2, 0.3, 0.4, 0.5],
[0.3, 0.2, 0.2, 0.6, 0.3],
[0.9, 0.8, 0.7, 0.6, 0.5],
[2.0, 2.0, 2.0, 2.0, 2.0]])
indices = tf.constant([0, 2, 0, 1])
num_segments = 3
tf.unsorted_segment_sum(to_be_sliced, indices, num_segments)
The output would be here
array([sum(row1+row3), row4, row2]
What I am looking for is 3 tensor with different shapes (maybe a list of tensors), first containing the first and third rows of the original (shape of (2, 5)), the second contains the 4th row (shape of (1, 5)), the third contains the second row, like this:
[array([[0.1, 0.2, 0.3, 0.4, 0.5],
[0.9, 0.8, 0.7, 0.6, 0.5]]),
array([[2.0, 2.0, 2.0, 2.0, 2.0]]),
array([[0.3, 0.2, 0.2, 0.6, 0.3]])]
Thanks in advance!
You can do that like this:
import tensorflow as tf
to_be_sliced = tf.constant([[0.1, 0.2, 0.3, 0.4, 0.5],
[0.3, 0.2, 0.2, 0.6, 0.3],
[0.9, 0.8, 0.7, 0.6, 0.5],
[2.0, 2.0, 2.0, 2.0, 2.0]])
indices = tf.constant([0, 2, 0, 1])
num_segments = 3
result = [tf.boolean_mask(to_be_sliced, tf.equal(indices, i)) for i in range(num_segments)]
with tf.Session() as sess:
print(*sess.run(result), sep='\n')
Output:
[[0.1 0.2 0.3 0.4 0.5]
[0.9 0.8 0.7 0.6 0.5]]
[[2. 2. 2. 2. 2.]]
[[0.3 0.2 0.2 0.6 0.3]]
For your case, you can do Numpy slicing in Tensorflow. So this will work:
sliced_1 = to_be_sliced[:3, :]
# [[0.4 0.5 0.5 0.7 0.8]
# [0.3 0.2 0.2 0.6 0.3]
# [0.3 0.2 0.2 0.6 0.3]]
sliced_2 = to_be_sliced[3, :]
# [0.3 0.2 0.2 0.6 0.3]
Or a more general option, you can do it in the following way:
to_be_sliced = tf.constant([[0.1, 0.2, 0.3, 0.4, 0.5],
[0.3, 0.2, 0.2, 0.6, 0.3],
[0.9, 0.8, 0.7, 0.6, 0.5],
[2.0, 2.0, 2.0, 2.0, 2.0]])
first_tensor = tf.gather_nd(to_be_sliced, [[0], [2]])
second_tensor = tf.gather_nd(to_be_sliced, [[3]])
third_tensor = tf.gather_nd(to_be_sliced, [[1]])
concat = tf.concat([first_tensor, second_tensor, third_tensor], axis=0)

Combining rows of the same key into single array

I have a pandas dataframe as follows:
error
0: [[0.1,0.4,-0.3]]
1: [[-0.6,-0.3,0.2]]
.
.
.
99: [[0.4,-0.7,0.1]]
I would like to combine all values into a single array like this:
[0.1,0.4,-0.3,-0.6,-0.3,0.2,...,0.4,-0.7,0.1]
Is there a fast way to do this using pandas or do I need to iterate over the data and build the array "manually" ?
The data order, in this case, is not important.
In a more general case, how to combine arrays that don't have the same size (e.g. row 0 contains an array of 3 elements, row 1 contains an array of 6 elements,etc...) ?
Use numpy.ravel:
L = np.array(df['error'].values.tolist()).ravel().tolist()
print (L)
[0.1, 0.4, -0.3, -0.6, -0.3, 0.2, 0.4, -0.7, 0.1]
More general solutions with str[0] for select nested lists:
print (df)
error
0 [[0.1,0.4,-0.3]]
1 [[-0.6,-0.3]]
99 [[0.4,-0.7,0.1]]
from itertools import chain
L = list(chain.from_iterable(df['error'].str[0]))
print (L)
[0.1, 0.4, -0.3, -0.6, -0.3, 0.4, -0.7, 0.1]
L = np.concatenate(df['error'].str[0].values).tolist()
print (L)
[0.1, 0.4, -0.3, -0.6, -0.3, 0.4, -0.7, 0.1]
df=pd.DataFrame([[0.1,0.4,-0.3],[-0.6,-0.3,0.2]])
df.values.flatten()
will return :
array([ 0.1, 0.4, -0.3, -0.6, -0.3, 0.2])
if you would like to append the element by column
df.values.flatten(order='F')
then it will return:
array([ 0.1, -0.6, 0.4, -0.3, -0.3, 0.2])

Iterating forward and backward in Python

I have a coding interface which has a counter component. It simply increments by 1 with every update. Consider it an infinite generator of {1,2,3,...} over time which I HAVE TO use.
I need to use this value and iterate from -1.5 to 1.5. So, the iteration should start from -1.5 and reach 1.5 and then from 1.5 back to -1.5.
How should I use this infinite iterator to generate an iteration in that range?
You can use cycle from itertools to repeat a sequence.
from itertools import cycle
# build the list with 0.1 increment
v = [(x-15)/10 for x in range(31)]
v = v + list(reversed(v))
cv = cycle(v)
for c in my_counter:
x = next(cv)
This will repeat the list v:
-1.5, -1.4, -1.3, -1.2, -1.1, -1.0, -0.9, -0.8, -0.7, -0.6, -0.5, -0.4,
-0.3, -0.2, -0.1, 0.0, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0,
1.1, 1.2, 1.3, 1.4, 1.5, 1.5, 1.4, 1.3, 1.2, 1.1, 1.0, 0.9, 0.8, 0.7,
0.6, 0.5, 0.4, 0.3, 0.2, 0.1, 0.0, -0.1, -0.2, -0.3, -0.4, -0.5, -0.6,
-0.7, -0.8, -0.9, -1.0, -1.1, -1.2, -1.3, -1.4, -1.5, -1.5, -1.4, -1.3,
-1.2, -1.1, -1.0, -0.9, -0.8, -0.7, -0.6, -0.5, -0.4, -0.3, -0.2, -0.1,
0.0, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0, 1.1, 1.2, 1.3,
1.4, 1.5, 1.5, 1.4, 1.3, 1.2, 1.1, 1.0, 0.9 ...
Something like:
import itertools
infGenGiven = itertools.count() # This is similar your generator
def func(x):
if x%2==0:
return 1.5
else:
return -1.5
infGenCycle = itertools.imap(func, infGenGiven)
count=0
while count<10:
print infGenCycle.next()
count+=1
Output:
1.5
-1.5
1.5
-1.5
1.5
-1.5
1.5
-1.5
1.5
-1.5
Note that this starts 1.5 because the first value in infGenGiven is 0, although for your generator it is 1 and so the infGenCycle output will give you what you want.
Thank you all.
I guess the best approach is to use the trigonometric functions (sine or cosine) which oscillate between plus and minus one.
More details at: https://en.wikipedia.org/wiki/Trigonometric_functions
Cheers

Get information out of sub-lists in main list elegantly

Ok, so here's my issue. I have a list composed of N sub-lists composed of M elements (floats) each. So in a general form it looks like this:
a_list = [b_list_1, b_list_2, ..., b_list_N]
with:
b_list_i = [c_float_1, c_float_2, ..., c_float_M]
For this example assume N=9 ; M=3, so the list looks like this:
a = [[1.1, 0.5, 0.7], [0.3, 1.4, 0.2], [0.6, 0.2, 1.], [1.1, 0.5, 0.3], [0.2, 1.1, 0.8], [1.1, 0.5, 1.], [1.2, 0.3, 0.6], [0.6, 0.4, 0.9], [0.6, 0.2, 0.5]]
I need to loop through this list identifying those items that share the same first two floats as the same item where the third float should be averaged before storing. This means I should check if an item was already identified as being repeated previously, so I do not identify it again as a new item.
To give a more clear idea of what I mean, this is what the output of processing list a should look like:
a_processed = [[1.1, 0.5, 0.67], [0.3, 1.4, 0.2], [0.6, 0.2, 0.75], [0.2, 1.1, 0.8], [1.2, 0.3, 0.6], [0.6, 0.4, 0.9]]
Note that the first item in this new list was identified three times in a (a[0], a[3] and a[5]) and so it was stored with its third float averaged ((0.7+0.3+1.)/3. = 0.67). The second item was not repeated in a so it was stored as is. The third item was found twice in a (a[2] and a[8]) and stored with its third float averaged ((1.+0.5)/2.=0.75). The rest of the items in the new list were not found as repeated in a so they were also stored with no modifications.
Since I know updating/modifying a list while looping through it is not recommended, I opted to use several temporary lists. This is the code I came up with:
import numpy as np
a = [[1.1, 0.5, 0.7], [0.3, 1.4, 0.2], [0.6, 0.2, 1.], [1.1, 0.5, 0.3],
[0.2, 1.1, 0.8], [1.1, 0.5, 1.], [1.2, 0.3, 0.6], [0.6, 0.4, 0.9],
[0.6, 0.2, 0.5]]
# Final list.
a_processed = []
# Holds indexes of elements to skip.
skip_elem = []
# Loop through all items in a.
for indx, elem in enumerate(a):
temp_average = []
temp_average.append(elem)
# Only process if not found previously.
if indx not in skip_elem:
for indx2, elem2 in enumerate(a[(indx+1):]):
if elem[0] == elem2[0] and elem[1] == elem2[1]:
temp_average.append(elem2)
skip_elem.append(indx2+indx+1)
# Store 1st and 2nd floats and averaged 3rd float.
a_processed.append([temp_average[0][0], temp_average[0][1],
round(np.mean([i[2] for i in temp_average]),2)])
This code works, but I'm wondering if there might be a more elegant/pythonic way of doing this. It just looks too convoluted (Fortran-esque I'd say) as is.
I think you can certainly make your code more concise and easier to read by using defaultdict to create a dictionary from the first two elements in each sublist to all the third items:
from collections import defaultdict
nums = defaultdict(list)
for arr in a:
key = tuple(arr[:2]) # make the first two floats the key
nums[key].append( arr[2] ) # append the third float for the given key
a_processed = [[k[0], k[1], sum(vals)/len(vals)] for k, vals in nums.items()]
Using this, I get the same output as you (albeit in a different order):
[[0.2, 1.1, 0.8], [1.2, 0.3, 0.6], [0.3, 1.4, 0.2], [0.6, 0.4, 0.9], [1.1, 0.5, 0.6666666666666666], [0.6, 0.2, 0.75]]
If the order of a_processed is an issue, you can use an OrderedDict, as pointed out by #DSM.
For comparison, here's the pandas approach. If this is really a data processing problem behind the scenes, then you can save yourself a lot of time that way.
>>> a
[[1.1, 0.5, 0.7], [0.3, 1.4, 0.2], [0.6, 0.2, 1.0], [1.1, 0.5, 0.3], [0.2, 1.1, 0.8], [1.1, 0.5, 1.0], [1.2, 0.3, 0.6], [0.6, 0.4, 0.9], [0.6, 0.2, 0.5]]
>>> df = pd.DataFrame(a)
>>> df.groupby([0,1]).mean()
2
0 1
0.2 1.1 0.800000
0.3 1.4 0.200000
0.6 0.2 0.750000
0.4 0.900000
1.1 0.5 0.666667
1.2 0.3 0.600000
This problem is common enough that it's a one-liner. You can use named columns, compute a host of other useful statistics, handle missing data, etc.

Categories