I am trying to capture values from a kernel distribution which gives almost 0 and at the end of the tail. My try is to take values from the kernel function , distributed in a timeline from -120 to 120 and make the percentage change for the values from the kernel , so then i can declared an arbitrary rule that 10 consecutive negative changes and have which kernel value is almost 0 i can declare as the starting point from the ending of the curve.
Illustration example for which point of the kernel function i want to obtain.
in this case the final value which i will like to obtain is around 300
my dataframe looks like (this is not the same example values from above) :
df
id event_time
1 2
1 3
1 3
1 5
1 9
1 10
2 1
2 1
2 2
2 2
2 5
2 5
# my try
def find_value(df):
if df.shape[0] == 1:
return df.iloc[0].event_time
kernel = stats.gaussian_kde(df['event_time'])
time = list(range(-120,120))
a = kernel(time)
b = np.diff(a) / a[:-1] * 100
so far i have a which represent Y axis from the graph and b which represent the change in Y. The reason that i did this is for making the logic made at the begging but dont know how to code it. after writing the function i was thinking in using an groupby and a apply
Related
To set up the question. I have a dataframe containing spots and their x,y positions. I want to iterate over each spot and check all other spots to see if they are within a radius. I then want to count the number of spots within the radius in a new column of the dataframe. I would like to iterate over the index as I have a decent understanding on how that works. I know that I am missing something simple but I have not been able to find a solution that works for me yet. Thank you in advance!
radius = 3
df = pd.DataFrame({'spot_id':[1,2,3,4,5],'x_pos':[5,4,10,3,8],'y_pos':[4,10,8,6,3]})
spot_id x_pos y_pos
0 1 5 4
1 2 4 10
2 3 10 8
3 4 3 6
4 5 8 3
I then want to get something that looks like this
spot_id x_pos y_pos spots_within_radius
0 1 5 4 1
1 2 4 10 0
2 3 10 8 0
3 4 3 6 1
4 5 8 3 0
To do it in a vectorized way, you can use scipy.spatial.distance_matrix to compute the distance matrix, D, between all the N row/position vectors ('x_pos', 'y_pos'). D is a N x N matrix (2D numpy.ndarray) whose entry (i, j) is the Euclidean distance between the ith and jth rows/ positions .
Then, check which positions are a distance = radius from each other (D <= radius), which will give you a boolean matrix. Finally, you can count all the True values row-wise using sum(axis=0). You have to subtract 1 in the end since the former counts the distance between a vector with itself (diagonal entries).
import pandas as pd
from scipy.spatial import distance_matrix
df = pd.DataFrame({'spot_id':[1,2,3,4,5],'x_pos':[5,4,10,3,8],'y_pos':[4,10,8,6,3]})
radius = 3
pos = df[['x_pos','y_pos']]
df['spots_within_radius'] = (distance_matrix(pos, pos) <= radius).sum(axis=0) - 1
Output
>>> df
spot_id x_pos y_pos spots_within_radius
0 1 5 4 1
1 2 4 10 0
2 3 10 8 0
3 4 3 6 1
4 5 8 3 0
If you don't want to use scipy.spatial.distance_matrix, you can compute D yourself using numpy's broadcasting.
import numpy as np
pos = df[['x_pos','y_pos']].to_numpy()
D = np.sum((pos - pos[:, None])**2, axis=-1) ** 0.5
df['spots_within_radius'] = (D <= radius).sum(axis=0) - 1
I would suggest using a KD Tree to answer this kind of question. It's a data structure designed to efficiently search for nearby points, and it's faster than computing a distance matrix. You can use SciKit Learn to implement this.
The code
Here's how:
import sklearn.neighbors
import pandas as pd
df = pd.DataFrame({'spot_id':[1,2,3,4,5],'x_pos':[5,4,10,3,8],'y_pos':[4,10,8,6,3]})
def add_points_in_range_column_kd(df, radius):
# Get positions as numpy array
positions = df[['x_pos', 'y_pos']].to_numpy(dtype='float32')
# Build KD Tree on those positions
tree = sklearn.neighbors.KDTree(positions)
# For each position, check how many points are in range.
# Return a count, and not the actual points.
return tree.query_radius(positions, r=radius, count_only=True) - 1
df['spots_within_radius'] = add_points_in_range_column_kd(df, 3)
The efficiency argument
Since a distance matrix needs to calculate distance between all points, it has a time complexity of O(N^2). In contrast, the time required to find all of the points inside the KD Tree is proportional to the depth of the tree times the number of points you need to find. On average, this is O(N log N). So this method will be more efficient for a large number of points.
Benchmarking
Theory is nice, but is it actually faster in practice?
I ran both a KD Tree method, and a distance matrix method, on dataframes of sizes ranging from N=10 to N=3000. I used the timeit module, running both methods in random order for 100 iterations for all point sizes. Here is a graph of the time it takes with each method:
For small numbers of points, the distance matrix method is faster. After you get 300 points to compare to each other, the KD Tree is faster. Note that this graph has a log axis on both scales.
Full testing details can be found here.
I have problem calculating variance with "hidden" NULL (zero) values. Usually that shouldn't be a problem because NULL value is not a value but in my case it is essential to include those NULLs as zero to variance calculation. So I have Dataframe that looks like this:
TableA:
A X Y
1 1 30
1 2 20
2 1 15
2 2 20
2 3 20
3 1 30
3 2 35
Then I need to get variance for each different X value and I do this:
TableA.groupby(['X']).agg({'Y':'var'})
But answer is not what I need since I would need the variance calculation to include also NULL value Y for X=3 when A=1 and A=3.
What my dataset should look like to get the needed variance results:
A X Y
1 1 30
1 2 20
1 3 0
2 1 15
2 2 20
2 3 20
3 1 30
3 2 35
3 3 0
So I need variance to take into account that every X should have 1,2 and 3 and when there are no values for Y in certain X number it should be 0. Could you help me in this? How should I change my TableA dataframe to be able to do this or is there another way?
Desired output for TableA should be like this:
X Y
1 75.000000
2 75.000000
3 133.333333
Compute the variance directly, but divide by the number of different possibilities for A
# three in your example. adjust as needed
a_choices = len(TableA['A'].unique())
def variance_with_missing(vals):
mean_with_missing = np.sum(vals) / a_choices
ss_present = np.sum((vals - mean_with_missing)**2)
ss_missing = (a_choices - len(vals)) * mean_with_missing**2
return (ss_present + ss_missing) / (a_choices - 1)
TableA.groupby(['X']).agg({'Y': variance_with_missing})
Approach of below solution is appending not existing sequence with Y=0. Little messy but hope this will help.
import numpy as np
import pandas as pd
TableA = pd.DataFrame({'A':[1,1,2,2,2,3,3],
'X':[1,2,1,2,3,1,2],
'Y':[30,20,15,20,20,30,35]})
TableA['A'] = TableA['A'].astype(int)
#### Create row with non existing sequence and fill with 0 ####
for i in range(1,TableA.X.max()+1):
for j in TableA.A.unique():
if not TableA[(TableA.X==i) & (TableA.A==j)]['Y'].values :
TableA = TableA.append(pd.DataFrame({'A':[j],'X':[i],'Y':[0]}),ignore_index=True)
TableA.groupby('X').agg({'Y':np.var})
Is there a way to simulate the follow output using scipy.signal instead of loops?
import pandas as pd
df_in = pd.DataFrame({'Generated':[13,8,7,6],'Consume':[8,10,20,5]})
print(df_in)
Generated Consume
0 13 8
1 8 10
2 7 20
3 6 5
df_in['balance'] = [5,3,0,1]
Where 13 - 8 equals a balance of 5, the 5 is carried balance to the next line and 5+8-10 yeilds a balance of 3.
The three is carried to the next line, 3+7-10 yeilds a negative number, but you can't carry a negative balance.
So, the next line 0 carry + 6 - 5 yeilds 1 balance.
print(df_in)
Expected output:
Generated Consume balance
0 13 8 5
1 8 10 3
2 7 20 0
3 6 5 1
If it wasn't for the requirement to only add to carry if the balance is positive, you could use an accumulator on the difference. This accumulator can be implemented using lfilter, obtaining the b and a parameters from the recurrence equation y[n+1] = y[n] + x[n]:
x = df_in['Generated'] - df_in['Consume']
df_in['balance'] = scipy.signal.lfilter([1], [1,-1], x)
Unfortunately adding the carry only if the balance stays positive makes the process non-linear which scipy.signal.lfilter is not made to handle. At this point you'd have to resort to using a loop to handle the special case.
I have a very big dataframe (20.000.000+ rows) that containes a column called 'sequence' amongst others.
The 'sequence' column is calculated from a time series applying a few conditional statements. The value "2" flags the start of a sequence, the value "3" flags the end of a sequence, the value "1" flags a datapoint within the sequence and the value "4" flags datapoints that need to be ignored. (Note: the flag values doesn't necessarily have to be 1,2,3,4)
What I want to achieve is a continous ID value (written in a seperate column - see 'desired_Id_Output' in the example below) that labels the slices of sequences from 2 - 3 in a unique fashion (the length of a sequence is variable ranging from 2 [Start+End only] to 5000+ datapoints) to be able to do further groupby-calculations on the individual sequences.
index sequence desired_Id_Output
0 2 1
1 1 1
2 1 1
3 1 1
4 1 1
5 3 1
6 2 2
7 1 2
8 1 2
9 3 2
10 4 NaN
11 4 NaN
12 2 3
13 3 3
Thanks in advance and BR!
I can't think of anything better than the "dumb" solution of looping through the entire thing, something like this:
import numpy as np
counter = 0
tmp = np.empty_like(df['sequence'].values, dtype=np.float)
for i in range(len(tmp)):
if df['sequence'][i] == 4:
tmp[i] = np.nan
else:
if df['sequence'][i] == 2:
counter += 1
tmp[i] = counter
df['desired_Id_output'] = tmp
Of course this is going to be pretty slow with a 20M-sized DataFrame. One way to improve this is via just-in-time compilation with numba:
import numba
#numba.njit
def foo(sequence):
# put in appropriate modification of the above code block
return tmp
and call this with argument df['sequence'].values.
Does it work to count the sequence starts? And then just set the ignore values (flag 4) afterwards. Like this:
sequence_starts = df.sequence == 2
sequence_ignore = df.sequence == 4
sequence_id = sequence_starts.cumsum()
sequence_id[sequence_ignore] = numpy.nan
I have a pandas series of value_counts for a data set. I would like to plot the data with a color band (I'm using bokeh, but calculating the data band is the important part):
I hesitate to use the word standard deviation since all the references I use calculate that based on the mean value, and I specifically want to use the mode as the center.
So, basically, I'm looking for a way in pandas to start at the mode and return a new series that of value counts that includes 68.2% of the sum of the value_counts. If I had this series:
val count
1 0
2 0
3 3
4 1
5 2
6 5 <-- mode
7 4
8 3
9 2
10 1
total = sum(count) # example value 21
band1_count = 21 * 0.682 # example value ~ 14.3
This is the order they would be added based on an algorithm that walks the value count on each side of the mode and includes the higher of the two until the sum of the counts is > than 14.3.
band1_values = [6, 7, 8, 5, 9]
Here are the steps:
val count step
1 0
2 0
3 3
4 1
5 2 <-- 4) add to list -- eq (9,2), closer to (6,5)
6 5 <-- 1) add to list -- mode
7 4 <-- 2) add to list -- gt (5,2)
8 3 <-- 3) add to list -- gt (5,2)
9 2 <-- 5) add to list -- gt (4,1), stop since sum of counts > 14.3
10 1
Is there a native way to do this calculation in pandas or numpy? If there is a formal name for this study, I would appreciate knowing what it's called.