I have a numpy series of numbers:
arr = np.array([1147.8, 1067.2, 957.6, 826.4])
And a pandas DF, with two columns, 'right' and 'left', that describe a range, whereas each range is contained in the next one in the DF:
right left
0 1090 1159.5
1 1080 1169.5
2 1057.5 1191.99
For each number in arr, I would like to get the index of the first range containing it. For the first number (1147.8), it's gonna be 0, since it's in the range (1090, 1159.5). For the second one, it's gonna be 2, since 1067.2 in (1057.5, 1191.99) but not in (1080, 1169.5) (and, of course, the other previous ranges)
I could iterate the DF for each number in arr, but I'm looking for a smarter way.
Thanks
Full cross-product between arr and df, then filter, then select first range. That's ok to do for small amounts of data. Ideally, you would do that all at once for all 2000 arrs. With around 2 million rows for the DataFrame after .merge(df_arr, how='cross'), the approach would still work in that case.
df_arr = pd.DataFrame({"arr": arr,
"id_arr": range(len(arr))})
(df.reset_index()
.merge(df_arr, how='cross')
.query("right < arr < left")
.groupby("id_arr")
.first())
Produces:
index right left arr
id_arr
0 0 1090.0 1159.50 1147.8
1 2 1057.5 1191.99 1067.2
Where index is the index of the tightest range.
The id_arr is used for grouping in case you have duplicate values in arr and you expect duplicate values in the results. If that's not relevant, one could also group by arr directly.
Related
I have panda dataframe indexed by ID and sorted by value. I want to create a sample size of n=20000 where there are 40000 rows in total and 2 rows are consecutive/paired. I want to perform additional calculations on these 2 consecutive / paired rows
e.g. If I say sample size n=2 I want to randomly pick and find the difference in distance of each of the following picks.
Additional condition: value difference can't exceed 4000.
index value distance
cg13869341 15865 1.635450
cg14008030 18827 4.161332
Then distance of the following etc
cg20826792 29425 0.657369
cg33045430 29407 1.708055
Sample original dataframe
index value distance
cg13869341 15865 1.635450
cg14008030 18827 4.161332
cg12045430 29407 0.708055
cg20826792 29425 0.657369
cg33045430 69407 1.708055
cg40826792 59425 0.857369
cg47454306 88407 0.708055
cg60826792 96425 2.857369
I tried using df_sample = df.sample(n=20000) Then i got bit lost trying to figure out how to get the next row for each value in df_sample
original shape is (480136, 14)
If it doesn't matter to always have (even, odd) pairs (which decreases a bit randomness), you can select n odd rows and get the next even:
N = 20000
# get the indices of N random ODD rows
idx = df.loc[::2].sample(n=N).index
# create a boolean mask to identify the rows
m = df.index.to_series().isin(idx)
# select those OR the next ones
df_sample = df.loc[m|m.shift()]
Example output on the toy DataFrame (N=3):
index value distance
2 cg12045430 29407 0.708055
3 cg20826792 29425 0.657369
4 cg33045430 69407 1.708055
5 cg40826792 59425 0.857369
6 cg47454306 88407 0.708055
7 cg60826792 96425 2.857369
increasing randomness
The drawback of the above approach is that there is a bias to always have (odd, even) pairs. To overcome this we can first remove a random fraction of the DataFrame, small enough to still leave enough choice to pick rows, but large enough to randomly shift the (odd, even) to (even, odd) pairs on many locations. The fraction of rows to remove should be tested depending on the initial size and the sampled size. I used 20-30% here:
N = 20000
frac = 0.2
idx = (df
.drop(df.sample(frac=frac).index)
.loc[::2].sample(n=N)
.index
)
m = df.index.to_series().isin(idx)
df_sample = df.loc[m|m.shift()]
# check:
# len(df_sample)
# 40000
Here's my first attempt (I only just noticed your additional constraint, and I'm not sure if you need the precise number of samples, in which case, you'll have to do some fudging after the line c=c[mask] below).
import random
# Temporarily reset index so we can have something that we can add one to.
df = df.reset_index(level=0)
# Choose the first index of each pair.
# Use random.sample if you don't want repeats,
# or random.choice if you don't mind them.
# The code below does allow overlapping pairs such as (1,2) and (2,3).
first_indices = np.array(random.sample(sorted(df.index[:-1]), 4))
# Filter out those indices where the diff with the next row down is large.
mask = [abs(df.loc[i, "value"] - df.loc[i+1, "value"]) > 4000 for i in c]
c = c[mask]
# Interleave this array with the same numbers, plus 1.
c = np.empty((first_indices.size * 2,), dtype=first_indices.dtype)
c[0::2] = first_indices
c[1::2] = first_indices + 1
# Filter
df_sample = df[df.index.isin(c)]
# Restore original index if required.
df = df.set_index("index")
Hope that helps. Regarding the bit where I use a mask to filter c, this answer might be of help if you need faster alternatives: Filtering (reducing) a NumPy Array
Given a data like so:
Symbol
One
Two
1
28.75
25.10
2
29.00
25.15
3
29.10
25.00
I want to drop the column which does not have its values in an ascending order (though I want to allow for gaps) across all rows. In this case, I want to drop column 'Two'.I tried to following code with no luck:
df.drop(df.columns[df.all(x <= y for x,y in zip(df, df[1:]))])
Thanks
Dropping those columns that give at least one (any) negative value (lt(0)) when their values are differenced by 1 lag (diff(1)) after NaNs are neglected (dropna):
columns_to_drop = [col for col in df.columns if df[col].diff(1).dropna().lt(0).any()]
df.drop(columns=columns_to_drop)
Symbol One
0 1 28.75
1 2 29.00
2 3 29.10
An expression that works with gaps (NaN)
A.loc[:, ~(A.iloc[1:, :].reset_index() > A.iloc[:-1, :].reset_index()).any()]
Without gaps it would be equivalent to
A.loc[:, (A.iloc[1:, :].reset_index() <= A.iloc[:-1, :].reset_index()).all()]
Without loops to take better advantage of the framework for bigger dataframes.
A.iloc[1:, :] returns a dataframe without the first line
A.iloc[:-1, :] returns a dataframe without the last line
Slices in a dataframe keep the indices for corresponding rows, so the different slices have different indices, reset_index will create another index counting [0,1,...], thus making the two sides of the inequality compatible. You can pass drop=True if you want to remove the previous index.
Any (implicitly with axis=0) check for every column if any value is true, if so, it means that a number was followed by another.
A.loc[:, mask] select the columns where mask is true, drops the columns where mask is false.
The logic is could be read as not any value smaller than its predecessor or all values greater than its predecessor.
Check out code and only logic is:
map(lambda i: list(df[i]) == sorted(list(df[i])), df.columns)]
import pandas as pd
import numpy as np
df = pd.DataFrame(
{
'Symbol': [1, 2, 3],
'One': [28.75, 29.00, 29.10],
'Two': [25.10, 25.15, 25.10],
}
)
print(df.loc[:,map(lambda i: list(df[i]) == sorted(list(df[i])), df.columns)])
I have this dataframe:
np.random.seed(0)
N = 10000
N_Seg = 100
df = pd.DataFrame({"Rut_Num": range(1,N+1),
"Segmento": np.random.choice(
["Afluente", "Afluente","Premium", "Preferente", "Preferente", "Preferente", "Preferente", "Clásico", "Clásico", "Clásico", "Clásico", "Clásico", "Clásico"], N),
"If_Seguro": np.random.choice([0,1,1], N)})
df.head()
Rut_Num Segmento If_Seguro
0 1 Clásico 1
1 2 Preferente 0
2 3 Afluente 0
3 4 Preferente 0
4 5 Clásico 1
When the column If_Seguro is 1, I need a random number between 1 and N_Seg+1, if its 0, I need a 0:
np.random.seed()
df.loc[:,"id_Seguro"] = np.where(df["If_Seguro"] == 1, np.random.choice(range(1,N_Seg+1),1),0)
df["id_Seguro"].value_counts()
You can see that the np.where() true condition will give the same number for all the ones when I need a random number for each 1 from If_Seguro
Besides, why np.where() computes np.random.choice() only once for the whole column and it doesn't compute it for each validation (each row) in the column?
The expression np.where(df["If_Seguro"] == 1, np.random.choice(range(1,N_Seg+1),1),0) shows what is in my opinion a frequently encountered, but generally undesirable use of where. The solution will also answer your question as to why only one value is being generated.
np.where does not compute much. It just selects values based on a mask from a pair of existing arrays. Normal python semantics don't change here. You are passing in the result of a function call, not the function itself, so it's the value that is used. This means that you need to compute np.random.choice(...) for all of the rows of df, not just the ones where df["If_Seguro"] == 1.
df["If_Seguro"] is a mask, and numpy provides you with some tools for worrying with masks. For example, the actual number of elements you want to generate is
np.count_nonzero(df["If_Seguro"])
The row locations where you want to insert those values is given by the mask itself. Both numpy and pandas allow you to index with a boolean mask directly. np.where is just an extra layer of inefficiency in many cases.
Finally, to generate N samples from an existing sequence, do either:
np.random.choice(range(1, N_Seg + 1), size=N, replace=True)
replace=True allows the samples to repeat, as your original call to np.where likely intended. A much better way to do the same thing does not involve an explicit sequence object:
np.random.randint(1, N_Seg + 1, N)
In the proposed solution, where will be the number of masked elements, whereas in your original code it should have been N.
So finally we have:
mask = df["If_Seguro"]
df.loc[mask, "id_Seguro"] = np.random.randint(1, 1 + N_Seg, np.count_nonzero(mask))
If id_Seguro is not already zeroed out to start with, you can do one of a couple of things. Adding on to the previous:
df.loc[~mask, "id_Seguro"] = 0
Or generating a new array from scratch:
mask = df["If_Seguro"]
result = np.zeros(N)
result[mask] = np.random.randint(1, 1 + N_Seg, np.count_nonzero(mask))
df["id_Seguro"] = result
I have a Pandas dataframe with two columns, x and y, that correspond to a large signal. It is about 3 million rows in size.
Wavelength from dataframe
I am trying to isolate the peaks from the signal. After using scipy, I got a 1D Python list corresponding to the indexes of the peaks. However, they are not the actual x-values of the signal, but just the index of their corresponding row:
from scipy.signal import find_peaks
peaks, _ = find_peaks(y, height=(None, peakline))
So, I decided I would just filter the original dataframe by setting all values in its y column to NaN unless they were on an index found in the peak list. I did this iteratively, however, since it is 3000000 rows, it is extremely slow:
peak_index = 0
for data_index in list(data.index):
if data_index != peaks[peak_index]:
data[data_index, 1] = float('NaN')
else:
peak_index += 1
Does anyone know what a faster method of filtering a Pandas dataframe might be?
Looping in most cases is extremely inefficient when it comes to pandas. Assuming you just need filtered DataFrame that contains the values of both x and y columns only when y is a peak, you may use the following piece of code:
df.iloc[peaks]
Alternatively, if you are hoping to retrieve an original DataFrame with y column retaining its peak values and having NaN otherwise, then please use:
df.y = df.y.where(df.y.iloc[peaks] == df.y.iloc[peaks])
Finally, since you seem to care about just the x values of the peaks, you might just rework the first piece in the following way:
df.iloc[peaks].x
I have three DataFrames that are all the same shape ~(1,000, 10,000).
original has ~20-100 non-zero values per row - very sparse
input is a copy of original, with 10 random non-zero values per row changed to zero
output is populated completely with non-zero values
I am now attempting to compare original and output only in the positions where input and output are different (i.e. just in the 10 randomly chosen positions)
Firstly, I create a df of only these elements of original with everything else set to zero:
maskedOriginal = original.where(original != input, other=0)
This is created in seconds. I then attempt to do the same for output:
maskedOutput = output.where(original != input, other=0)
However, since this is now working with 3 DataFrames, it is far too slow - I haven't even got a result after a couple of minutes. Is there any more suitable way to do this?
Use numpy.where with DataFrame contructor:
arr = original.values
maskedOriginal = pd.DataFrame(np.where(arr != input, arr, 0),
index=original.index,
columns=original.columns)