Integer matrix to stochastic matrix normalization - python

Suppose I have matrix with integer values. I want to make it stochastic matrix (i.e. sum of each row in matrix equal to 1)
I create random matrix, count sum of each row and divide each element in row for row sum.
dt = pd.DataFrame(np.random.randint(0,10000,size=10000).reshape(100,100))
dt['sum_row'] = dt.sum(axis=1)
for col_n in dt.columns[:-1]:
dt[col_n] = dt[col_n] / dt['sum_row']
After this sum of each row should be equal to 1. But it is not.
(dt.sum_row_normalized == 1).value_counts()
> False 75
> True 25
> Name: sum_row_normalized, dtype: int64
I understand that some values is not exactly 1 but very close to it. Nevertheless, how can I normalize matrix correctly?

You can't guarantee the floats will be exactly one, but you can check the closely to an arbitrary precision with np.around.
This is probably easier/faster without looping through pandas columns.
X = np.random.randint(0,10000,size=10000).reshape(100,100)
X_float = X.astype(float)
Y = X_float/X_float.sum(axis=1)[:,np.newaxis]
sum(np.around(Y.sum(axis=1),decimals=10)==1) # is 100
(you don't need the .astype(float) step in python 3.x)

Related

Pandas random n samples of consecutive rows / pairs

I have panda dataframe indexed by ID and sorted by value. I want to create a sample size of n=20000 where there are 40000 rows in total and 2 rows are consecutive/paired. I want to perform additional calculations on these 2 consecutive / paired rows
e.g. If I say sample size n=2 I want to randomly pick and find the difference in distance of each of the following picks.
Additional condition: value difference can't exceed 4000.
index value distance
cg13869341 15865 1.635450
cg14008030 18827 4.161332
Then distance of the following etc
cg20826792 29425 0.657369
cg33045430 29407 1.708055
Sample original dataframe
index value distance
cg13869341 15865 1.635450
cg14008030 18827 4.161332
cg12045430 29407 0.708055
cg20826792 29425 0.657369
cg33045430 69407 1.708055
cg40826792 59425 0.857369
cg47454306 88407 0.708055
cg60826792 96425 2.857369
I tried using df_sample = df.sample(n=20000) Then i got bit lost trying to figure out how to get the next row for each value in df_sample
original shape is (480136, 14)
If it doesn't matter to always have (even, odd) pairs (which decreases a bit randomness), you can select n odd rows and get the next even:
N = 20000
# get the indices of N random ODD rows
idx = df.loc[::2].sample(n=N).index
# create a boolean mask to identify the rows
m = df.index.to_series().isin(idx)
# select those OR the next ones
df_sample = df.loc[m|m.shift()]
Example output on the toy DataFrame (N=3):
index value distance
2 cg12045430 29407 0.708055
3 cg20826792 29425 0.657369
4 cg33045430 69407 1.708055
5 cg40826792 59425 0.857369
6 cg47454306 88407 0.708055
7 cg60826792 96425 2.857369
increasing randomness
The drawback of the above approach is that there is a bias to always have (odd, even) pairs. To overcome this we can first remove a random fraction of the DataFrame, small enough to still leave enough choice to pick rows, but large enough to randomly shift the (odd, even) to (even, odd) pairs on many locations. The fraction of rows to remove should be tested depending on the initial size and the sampled size. I used 20-30% here:
N = 20000
frac = 0.2
idx = (df
.drop(df.sample(frac=frac).index)
.loc[::2].sample(n=N)
.index
)
m = df.index.to_series().isin(idx)
df_sample = df.loc[m|m.shift()]
# check:
# len(df_sample)
# 40000
Here's my first attempt (I only just noticed your additional constraint, and I'm not sure if you need the precise number of samples, in which case, you'll have to do some fudging after the line c=c[mask] below).
import random
# Temporarily reset index so we can have something that we can add one to.
df = df.reset_index(level=0)
# Choose the first index of each pair.
# Use random.sample if you don't want repeats,
# or random.choice if you don't mind them.
# The code below does allow overlapping pairs such as (1,2) and (2,3).
first_indices = np.array(random.sample(sorted(df.index[:-1]), 4))
# Filter out those indices where the diff with the next row down is large.
mask = [abs(df.loc[i, "value"] - df.loc[i+1, "value"]) > 4000 for i in c]
c = c[mask]
# Interleave this array with the same numbers, plus 1.
c = np.empty((first_indices.size * 2,), dtype=first_indices.dtype)
c[0::2] = first_indices
c[1::2] = first_indices + 1
# Filter
df_sample = df[df.index.isin(c)]
# Restore original index if required.
df = df.set_index("index")
Hope that helps. Regarding the bit where I use a mask to filter c, this answer might be of help if you need faster alternatives: Filtering (reducing) a NumPy Array

Masking few non-zero elements of certain rows of a matrix

I have a 3*3 matrix with 1s and 0s, A = [[1,0,1],[0,1,1],[1,0,0]] and an array indicating the limit on the row sum, B = [1,2,1]. I want to find rows of A whose sum exceeds the corresponding value in B, and set the non-zero elements of A to zero to ensure that the sum matches with B. Finding the rows of A that exceed the sum is easy, however masking the elements to adjust the sum is what I need help with.How can this be achieved (want to scale it to larger matrices and tensors)?
I would do smth like this:
import numpy as np
A = np.array([[1,0,1],[0,1,1],[1,0,0]])
B = np.array([1,2,1])
# a cumulative sum of each row will tell you how many
# ones were in that row up to each point.
A_cs = np.cumsum(A, axis = 1)
# theresholding according to the sum vector will let
# you know where you should start omitting values since
# at that point the sum of the row exceeds its limit.
A_th = A_cs > B[:, None]
# then you can use the boolean array to create a new
# array where the appropriate values in the original
# array are set to zero to reduce the row sum.
A_nw = A * (1 - A_th)
output:
A_nw =
[[1 0 0]
[0 1 1]
[1 0 0]]
Unrelated note:
The following note is here to help OP better their dev-related search skills.
I can answer some questions instantaneously, but this was not one of them. I'm telling you this because I reached the answer through a simple google search for "python find the i th non zero element in each row", which led me to this post which in turn led me very quickly to an answer. You don't have to try to be a better, more independet code writer. But if you want to, know that you can.

np.where() computes np.random.choice() only once - pandas

I have this dataframe:
np.random.seed(0)
N = 10000
N_Seg = 100
df = pd.DataFrame({"Rut_Num": range(1,N+1),
"Segmento": np.random.choice(
["Afluente", "Afluente","Premium", "Preferente", "Preferente", "Preferente", "Preferente", "Clásico", "Clásico", "Clásico", "Clásico", "Clásico", "Clásico"], N),
"If_Seguro": np.random.choice([0,1,1], N)})
df.head()
Rut_Num Segmento If_Seguro
0 1 Clásico 1
1 2 Preferente 0
2 3 Afluente 0
3 4 Preferente 0
4 5 Clásico 1
When the column If_Seguro is 1, I need a random number between 1 and N_Seg+1, if its 0, I need a 0:
np.random.seed()
df.loc[:,"id_Seguro"] = np.where(df["If_Seguro"] == 1, np.random.choice(range(1,N_Seg+1),1),0)
df["id_Seguro"].value_counts()
You can see that the np.where() true condition will give the same number for all the ones when I need a random number for each 1 from If_Seguro
Besides, why np.where() computes np.random.choice() only once for the whole column and it doesn't compute it for each validation (each row) in the column?
The expression np.where(df["If_Seguro"] == 1, np.random.choice(range(1,N_Seg+1),1),0) shows what is in my opinion a frequently encountered, but generally undesirable use of where. The solution will also answer your question as to why only one value is being generated.
np.where does not compute much. It just selects values based on a mask from a pair of existing arrays. Normal python semantics don't change here. You are passing in the result of a function call, not the function itself, so it's the value that is used. This means that you need to compute np.random.choice(...) for all of the rows of df, not just the ones where df["If_Seguro"] == 1.
df["If_Seguro"] is a mask, and numpy provides you with some tools for worrying with masks. For example, the actual number of elements you want to generate is
np.count_nonzero(df["If_Seguro"])
The row locations where you want to insert those values is given by the mask itself. Both numpy and pandas allow you to index with a boolean mask directly. np.where is just an extra layer of inefficiency in many cases.
Finally, to generate N samples from an existing sequence, do either:
np.random.choice(range(1, N_Seg + 1), size=N, replace=True)
replace=True allows the samples to repeat, as your original call to np.where likely intended. A much better way to do the same thing does not involve an explicit sequence object:
np.random.randint(1, N_Seg + 1, N)
In the proposed solution, where will be the number of masked elements, whereas in your original code it should have been N.
So finally we have:
mask = df["If_Seguro"]
df.loc[mask, "id_Seguro"] = np.random.randint(1, 1 + N_Seg, np.count_nonzero(mask))
If id_Seguro is not already zeroed out to start with, you can do one of a couple of things. Adding on to the previous:
df.loc[~mask, "id_Seguro"] = 0
Or generating a new array from scratch:
mask = df["If_Seguro"]
result = np.zeros(N)
result[mask] = np.random.randint(1, 1 + N_Seg, np.count_nonzero(mask))
df["id_Seguro"] = result

Generate numpy array with duplicate rate

Here is my problem: i have to generate some synthetic data (like 7/8 columns), correlated each other (using pearson coefficient). I can do this easily, but next i have to insert a percentage of duplicates in each column (yes, pearson coefficent will be lower), different for each column.
The problem is that i don't want to insert personally that duplicates, cause in my case it would be like to cheat.
Someone knows how to generate correlated data already duplicates ? I've searched but usually questions are about drop or avoid duplicates..
Language: python3
To generate correlated data i'm using this simple code: Generatin correlated data
Try something like this :
indices = np.random.randint(0, array.shape[0], size = int(np.ceil(percentage * array.shape[0])))
for index in indices:
array.append(array[index])
Here I make the assumption that your data is stored in array which is an ndarray, where each row contains your 7/8 columns of data.
The above code should create an array of random indices, whose entries (rows) you select and append again to the array.
i find out the solution.
I post the code, it might be helpful for someone.
#this are the data, generated randomically with a given shape
rnd = np.random.random(size=(10**7, 8))
#that array represent a column of the covariance matrix (i want correlated data, so i randomically choose a number between 0.8 and 0.95)
#I added other 7 columns, with varing range of values (all upper than 0.7)
attr1 = np.random.uniform(0.8, .95, size = (8,1))
#attr2,3,4,5,6,7 like attr1
#corr_mat is the matrix, union of columns
corr_mat = np.column_stack((attr1,attr2,attr3,attr4,attr5, attr6,attr7,attr8))
from statsmodels.stats.correlation_tools import cov_nearest
#using that function i found the nearest covariance matrix to my matrix,
#to be sure that it's positive definite
a = cov_nearest(corr_mat)
from scipy.linalg import cholesky
upper_chol = cholesky(a)
# Finally, compute the inner product of upper_chol and rnd
ans = rnd # upper_chol
#ans now has randomically correlated data (high correlation, but is customizable)
#next i create a pandas Dataframe with ans values
df = pd.DataFrame(ans, columns=['att1', 'att2', 'att3', 'att4',
'att5', 'att6', 'att7', 'att8'])
#last step is to truncate float values of ans in a variable way, so i got
#duplicates in varying percentage
a = df.values
for i in range(8):
trunc = np.random.randint(5,12)
print(trunc)
a.T[i] = a.T[i].round(decimals=trunc)
#float values of ans have 16 decimals, so i randomically choose an int
# between 5 and 12 and i use it to truncate each value
Finally, those are my duplicates percentages for each column:
duplicate rate attribute: att1 = 5.159390000000002
duplicate rate attribute: att2 = 11.852260000000001
duplicate rate attribute: att3 = 12.036079999999998
duplicate rate attribute: att4 = 35.10611
duplicate rate attribute: att5 = 4.6471599999999995
duplicate rate attribute: att6 = 35.46553
duplicate rate attribute: att7 = 0.49115000000000464
duplicate rate attribute: att8 = 37.33252

How to sum over columns with some weight in a csr matrix in python

If I have a large csr_matrix A, I want to sum over its columns, simply
A.sum(axis=0)
does this for me, right? Are the corresponding axis values: 1->rows, 0->columns?
I stuck when I want to sum over columns with some weights which are specified in a list, e.g. [1 2 3 4 5 4 3 ... 4 2 5] with the same length as the number of rows in the csr_matrix A. To be more clear, I want the inner product of each column vector with this weight vector. How can I achieve this with Python?
This is a part of my code:
uniFeature = csr_matrix(uniFeature)
[I,J] = uniFeature.shape
sumfreq = uniFeature.sum(axis=0)
sumratings = []
for j in range(J):
column = uniFeature.getcol(j)
column = column.toarray()
sumtemp = np.dot(ratings,column)
sumratings.append(sumtemp)
sumfreq = sumfreq.toarray()
average = np.true_divide(sumratings,sumfreq)
(Numpy is imported as np) There is a weight vector "ratings", the program is supposed to output the average rating for each column of the matrix "uniFeature".
I experimented to dot column=uniFeature.getcol(j) directly with ratings(which is a list), but there is an error that says format does not agree. It's ok after column.toarray() then dot with ratings. But isn't making each column back to dense form losing the point of having the sparse matrix and would be very slow? I ran the above code and it's too slow to show the results. I guess there should be a way that dots the vector "ratings" with each column of the sparse matrix efficiently.
Thanks in advance!

Categories