I have panda dataframe indexed by ID and sorted by value. I want to create a sample size of n=20000 where there are 40000 rows in total and 2 rows are consecutive/paired. I want to perform additional calculations on these 2 consecutive / paired rows
e.g. If I say sample size n=2 I want to randomly pick and find the difference in distance of each of the following picks.
Additional condition: value difference can't exceed 4000.
index value distance
cg13869341 15865 1.635450
cg14008030 18827 4.161332
Then distance of the following etc
cg20826792 29425 0.657369
cg33045430 29407 1.708055
Sample original dataframe
index value distance
cg13869341 15865 1.635450
cg14008030 18827 4.161332
cg12045430 29407 0.708055
cg20826792 29425 0.657369
cg33045430 69407 1.708055
cg40826792 59425 0.857369
cg47454306 88407 0.708055
cg60826792 96425 2.857369
I tried using df_sample = df.sample(n=20000) Then i got bit lost trying to figure out how to get the next row for each value in df_sample
original shape is (480136, 14)
If it doesn't matter to always have (even, odd) pairs (which decreases a bit randomness), you can select n odd rows and get the next even:
N = 20000
# get the indices of N random ODD rows
idx = df.loc[::2].sample(n=N).index
# create a boolean mask to identify the rows
m = df.index.to_series().isin(idx)
# select those OR the next ones
df_sample = df.loc[m|m.shift()]
Example output on the toy DataFrame (N=3):
index value distance
2 cg12045430 29407 0.708055
3 cg20826792 29425 0.657369
4 cg33045430 69407 1.708055
5 cg40826792 59425 0.857369
6 cg47454306 88407 0.708055
7 cg60826792 96425 2.857369
increasing randomness
The drawback of the above approach is that there is a bias to always have (odd, even) pairs. To overcome this we can first remove a random fraction of the DataFrame, small enough to still leave enough choice to pick rows, but large enough to randomly shift the (odd, even) to (even, odd) pairs on many locations. The fraction of rows to remove should be tested depending on the initial size and the sampled size. I used 20-30% here:
N = 20000
frac = 0.2
idx = (df
.drop(df.sample(frac=frac).index)
.loc[::2].sample(n=N)
.index
)
m = df.index.to_series().isin(idx)
df_sample = df.loc[m|m.shift()]
# check:
# len(df_sample)
# 40000
Here's my first attempt (I only just noticed your additional constraint, and I'm not sure if you need the precise number of samples, in which case, you'll have to do some fudging after the line c=c[mask] below).
import random
# Temporarily reset index so we can have something that we can add one to.
df = df.reset_index(level=0)
# Choose the first index of each pair.
# Use random.sample if you don't want repeats,
# or random.choice if you don't mind them.
# The code below does allow overlapping pairs such as (1,2) and (2,3).
first_indices = np.array(random.sample(sorted(df.index[:-1]), 4))
# Filter out those indices where the diff with the next row down is large.
mask = [abs(df.loc[i, "value"] - df.loc[i+1, "value"]) > 4000 for i in c]
c = c[mask]
# Interleave this array with the same numbers, plus 1.
c = np.empty((first_indices.size * 2,), dtype=first_indices.dtype)
c[0::2] = first_indices
c[1::2] = first_indices + 1
# Filter
df_sample = df[df.index.isin(c)]
# Restore original index if required.
df = df.set_index("index")
Hope that helps. Regarding the bit where I use a mask to filter c, this answer might be of help if you need faster alternatives: Filtering (reducing) a NumPy Array
Related
I have a numpy series of numbers:
arr = np.array([1147.8, 1067.2, 957.6, 826.4])
And a pandas DF, with two columns, 'right' and 'left', that describe a range, whereas each range is contained in the next one in the DF:
right left
0 1090 1159.5
1 1080 1169.5
2 1057.5 1191.99
For each number in arr, I would like to get the index of the first range containing it. For the first number (1147.8), it's gonna be 0, since it's in the range (1090, 1159.5). For the second one, it's gonna be 2, since 1067.2 in (1057.5, 1191.99) but not in (1080, 1169.5) (and, of course, the other previous ranges)
I could iterate the DF for each number in arr, but I'm looking for a smarter way.
Thanks
Full cross-product between arr and df, then filter, then select first range. That's ok to do for small amounts of data. Ideally, you would do that all at once for all 2000 arrs. With around 2 million rows for the DataFrame after .merge(df_arr, how='cross'), the approach would still work in that case.
df_arr = pd.DataFrame({"arr": arr,
"id_arr": range(len(arr))})
(df.reset_index()
.merge(df_arr, how='cross')
.query("right < arr < left")
.groupby("id_arr")
.first())
Produces:
index right left arr
id_arr
0 0 1090.0 1159.50 1147.8
1 2 1057.5 1191.99 1067.2
Where index is the index of the tightest range.
The id_arr is used for grouping in case you have duplicate values in arr and you expect duplicate values in the results. If that's not relevant, one could also group by arr directly.
I have a 3*3 matrix with 1s and 0s, A = [[1,0,1],[0,1,1],[1,0,0]] and an array indicating the limit on the row sum, B = [1,2,1]. I want to find rows of A whose sum exceeds the corresponding value in B, and set the non-zero elements of A to zero to ensure that the sum matches with B. Finding the rows of A that exceed the sum is easy, however masking the elements to adjust the sum is what I need help with.How can this be achieved (want to scale it to larger matrices and tensors)?
I would do smth like this:
import numpy as np
A = np.array([[1,0,1],[0,1,1],[1,0,0]])
B = np.array([1,2,1])
# a cumulative sum of each row will tell you how many
# ones were in that row up to each point.
A_cs = np.cumsum(A, axis = 1)
# theresholding according to the sum vector will let
# you know where you should start omitting values since
# at that point the sum of the row exceeds its limit.
A_th = A_cs > B[:, None]
# then you can use the boolean array to create a new
# array where the appropriate values in the original
# array are set to zero to reduce the row sum.
A_nw = A * (1 - A_th)
output:
A_nw =
[[1 0 0]
[0 1 1]
[1 0 0]]
Unrelated note:
The following note is here to help OP better their dev-related search skills.
I can answer some questions instantaneously, but this was not one of them. I'm telling you this because I reached the answer through a simple google search for "python find the i th non zero element in each row", which led me to this post which in turn led me very quickly to an answer. You don't have to try to be a better, more independet code writer. But if you want to, know that you can.
I have two matrices with the same number of columns but a different number of rows, one is a lot larger.
matA = [[1,0,1],[0,0,0],[1,1,0]], matB = [[0,0,0],[1,0,1],[0,0,0],[1,1,1],[1,1,0]]
Both of them are numpy matrices
I am trying to find how many times each row of matA appears in matB and put that in an array so the array in this case will become arr = [1,2,1] because the first row of matA appeared one time in mat, the second row appeared two times and the last row only one time
Find unique rows in numpy.array
What is a faster way to get the location of unique rows in numpy
Here is a solution:
import numpy as np
A = np.array([[1,0,1],[0,0,0],[1,1,0]])
B = np.array([[0,0,0],[1,0,1],[0,0,0],[1,1,1],[1,1,0]])
# stack the rows, A has to be first
combined = np.concatenate((A, B), axis=0) #or np.vstack
unique, unique_indices, unique_counts = np.unique(combined,
return_index=True,
return_counts=True,
axis=0)
print(unique)
print(unique_indices)
print(unique_counts)
# now we need to derive your desired result from the unique
# indices and counts
# we know the number of rows in A
n_rows_in_A = A.shape[0]
# so we know that the indices from 0 to (n_rows_in_A - 1)
# in unique_indices are rows that appear first or only in A
indices_A = np.nonzero(unique_indices < n_rows_in_A)[0] #first
#indices_A1 = np.argwhere(unique_indices < n_rows_in_A)
print(indices_A)
#print(indices_A1)
unique_indices_A = unique_indices[indices_A]
unique_counts_A = unique_counts[indices_A]
print(unique_indices_A)
print(unique_counts_A)
# now we need to subtract one count from the unique_counts
# that's the one occurence in A that we are not interested in.
unique_counts_A -= 1
print(unique_indices_A)
print(unique_counts_A)
# this is nearly the result we want
# now we need to sort it and account for rows that are not
# appearing in A but in B
# will do that later...
I have this dataframe:
np.random.seed(0)
N = 10000
N_Seg = 100
df = pd.DataFrame({"Rut_Num": range(1,N+1),
"Segmento": np.random.choice(
["Afluente", "Afluente","Premium", "Preferente", "Preferente", "Preferente", "Preferente", "Clásico", "Clásico", "Clásico", "Clásico", "Clásico", "Clásico"], N),
"If_Seguro": np.random.choice([0,1,1], N)})
df.head()
Rut_Num Segmento If_Seguro
0 1 Clásico 1
1 2 Preferente 0
2 3 Afluente 0
3 4 Preferente 0
4 5 Clásico 1
When the column If_Seguro is 1, I need a random number between 1 and N_Seg+1, if its 0, I need a 0:
np.random.seed()
df.loc[:,"id_Seguro"] = np.where(df["If_Seguro"] == 1, np.random.choice(range(1,N_Seg+1),1),0)
df["id_Seguro"].value_counts()
You can see that the np.where() true condition will give the same number for all the ones when I need a random number for each 1 from If_Seguro
Besides, why np.where() computes np.random.choice() only once for the whole column and it doesn't compute it for each validation (each row) in the column?
The expression np.where(df["If_Seguro"] == 1, np.random.choice(range(1,N_Seg+1),1),0) shows what is in my opinion a frequently encountered, but generally undesirable use of where. The solution will also answer your question as to why only one value is being generated.
np.where does not compute much. It just selects values based on a mask from a pair of existing arrays. Normal python semantics don't change here. You are passing in the result of a function call, not the function itself, so it's the value that is used. This means that you need to compute np.random.choice(...) for all of the rows of df, not just the ones where df["If_Seguro"] == 1.
df["If_Seguro"] is a mask, and numpy provides you with some tools for worrying with masks. For example, the actual number of elements you want to generate is
np.count_nonzero(df["If_Seguro"])
The row locations where you want to insert those values is given by the mask itself. Both numpy and pandas allow you to index with a boolean mask directly. np.where is just an extra layer of inefficiency in many cases.
Finally, to generate N samples from an existing sequence, do either:
np.random.choice(range(1, N_Seg + 1), size=N, replace=True)
replace=True allows the samples to repeat, as your original call to np.where likely intended. A much better way to do the same thing does not involve an explicit sequence object:
np.random.randint(1, N_Seg + 1, N)
In the proposed solution, where will be the number of masked elements, whereas in your original code it should have been N.
So finally we have:
mask = df["If_Seguro"]
df.loc[mask, "id_Seguro"] = np.random.randint(1, 1 + N_Seg, np.count_nonzero(mask))
If id_Seguro is not already zeroed out to start with, you can do one of a couple of things. Adding on to the previous:
df.loc[~mask, "id_Seguro"] = 0
Or generating a new array from scratch:
mask = df["If_Seguro"]
result = np.zeros(N)
result[mask] = np.random.randint(1, 1 + N_Seg, np.count_nonzero(mask))
df["id_Seguro"] = result
Here is my problem: i have to generate some synthetic data (like 7/8 columns), correlated each other (using pearson coefficient). I can do this easily, but next i have to insert a percentage of duplicates in each column (yes, pearson coefficent will be lower), different for each column.
The problem is that i don't want to insert personally that duplicates, cause in my case it would be like to cheat.
Someone knows how to generate correlated data already duplicates ? I've searched but usually questions are about drop or avoid duplicates..
Language: python3
To generate correlated data i'm using this simple code: Generatin correlated data
Try something like this :
indices = np.random.randint(0, array.shape[0], size = int(np.ceil(percentage * array.shape[0])))
for index in indices:
array.append(array[index])
Here I make the assumption that your data is stored in array which is an ndarray, where each row contains your 7/8 columns of data.
The above code should create an array of random indices, whose entries (rows) you select and append again to the array.
i find out the solution.
I post the code, it might be helpful for someone.
#this are the data, generated randomically with a given shape
rnd = np.random.random(size=(10**7, 8))
#that array represent a column of the covariance matrix (i want correlated data, so i randomically choose a number between 0.8 and 0.95)
#I added other 7 columns, with varing range of values (all upper than 0.7)
attr1 = np.random.uniform(0.8, .95, size = (8,1))
#attr2,3,4,5,6,7 like attr1
#corr_mat is the matrix, union of columns
corr_mat = np.column_stack((attr1,attr2,attr3,attr4,attr5, attr6,attr7,attr8))
from statsmodels.stats.correlation_tools import cov_nearest
#using that function i found the nearest covariance matrix to my matrix,
#to be sure that it's positive definite
a = cov_nearest(corr_mat)
from scipy.linalg import cholesky
upper_chol = cholesky(a)
# Finally, compute the inner product of upper_chol and rnd
ans = rnd # upper_chol
#ans now has randomically correlated data (high correlation, but is customizable)
#next i create a pandas Dataframe with ans values
df = pd.DataFrame(ans, columns=['att1', 'att2', 'att3', 'att4',
'att5', 'att6', 'att7', 'att8'])
#last step is to truncate float values of ans in a variable way, so i got
#duplicates in varying percentage
a = df.values
for i in range(8):
trunc = np.random.randint(5,12)
print(trunc)
a.T[i] = a.T[i].round(decimals=trunc)
#float values of ans have 16 decimals, so i randomically choose an int
# between 5 and 12 and i use it to truncate each value
Finally, those are my duplicates percentages for each column:
duplicate rate attribute: att1 = 5.159390000000002
duplicate rate attribute: att2 = 11.852260000000001
duplicate rate attribute: att3 = 12.036079999999998
duplicate rate attribute: att4 = 35.10611
duplicate rate attribute: att5 = 4.6471599999999995
duplicate rate attribute: att6 = 35.46553
duplicate rate attribute: att7 = 0.49115000000000464
duplicate rate attribute: att8 = 37.33252