I have a pandas series of value_counts for a data set. I would like to plot the data with a color band (I'm using bokeh, but calculating the data band is the important part):
I hesitate to use the word standard deviation since all the references I use calculate that based on the mean value, and I specifically want to use the mode as the center.
So, basically, I'm looking for a way in pandas to start at the mode and return a new series that of value counts that includes 68.2% of the sum of the value_counts. If I had this series:
val count
1 0
2 0
3 3
4 1
5 2
6 5 <-- mode
7 4
8 3
9 2
10 1
total = sum(count) # example value 21
band1_count = 21 * 0.682 # example value ~ 14.3
This is the order they would be added based on an algorithm that walks the value count on each side of the mode and includes the higher of the two until the sum of the counts is > than 14.3.
band1_values = [6, 7, 8, 5, 9]
Here are the steps:
val count step
1 0
2 0
3 3
4 1
5 2 <-- 4) add to list -- eq (9,2), closer to (6,5)
6 5 <-- 1) add to list -- mode
7 4 <-- 2) add to list -- gt (5,2)
8 3 <-- 3) add to list -- gt (5,2)
9 2 <-- 5) add to list -- gt (4,1), stop since sum of counts > 14.3
10 1
Is there a native way to do this calculation in pandas or numpy? If there is a formal name for this study, I would appreciate knowing what it's called.
Related
I am trying to capture values from a kernel distribution which gives almost 0 and at the end of the tail. My try is to take values from the kernel function , distributed in a timeline from -120 to 120 and make the percentage change for the values from the kernel , so then i can declared an arbitrary rule that 10 consecutive negative changes and have which kernel value is almost 0 i can declare as the starting point from the ending of the curve.
Illustration example for which point of the kernel function i want to obtain.
in this case the final value which i will like to obtain is around 300
my dataframe looks like (this is not the same example values from above) :
df
id event_time
1 2
1 3
1 3
1 5
1 9
1 10
2 1
2 1
2 2
2 2
2 5
2 5
# my try
def find_value(df):
if df.shape[0] == 1:
return df.iloc[0].event_time
kernel = stats.gaussian_kde(df['event_time'])
time = list(range(-120,120))
a = kernel(time)
b = np.diff(a) / a[:-1] * 100
so far i have a which represent Y axis from the graph and b which represent the change in Y. The reason that i did this is for making the logic made at the begging but dont know how to code it. after writing the function i was thinking in using an groupby and a apply
I have problem calculating variance with "hidden" NULL (zero) values. Usually that shouldn't be a problem because NULL value is not a value but in my case it is essential to include those NULLs as zero to variance calculation. So I have Dataframe that looks like this:
TableA:
A X Y
1 1 30
1 2 20
2 1 15
2 2 20
2 3 20
3 1 30
3 2 35
Then I need to get variance for each different X value and I do this:
TableA.groupby(['X']).agg({'Y':'var'})
But answer is not what I need since I would need the variance calculation to include also NULL value Y for X=3 when A=1 and A=3.
What my dataset should look like to get the needed variance results:
A X Y
1 1 30
1 2 20
1 3 0
2 1 15
2 2 20
2 3 20
3 1 30
3 2 35
3 3 0
So I need variance to take into account that every X should have 1,2 and 3 and when there are no values for Y in certain X number it should be 0. Could you help me in this? How should I change my TableA dataframe to be able to do this or is there another way?
Desired output for TableA should be like this:
X Y
1 75.000000
2 75.000000
3 133.333333
Compute the variance directly, but divide by the number of different possibilities for A
# three in your example. adjust as needed
a_choices = len(TableA['A'].unique())
def variance_with_missing(vals):
mean_with_missing = np.sum(vals) / a_choices
ss_present = np.sum((vals - mean_with_missing)**2)
ss_missing = (a_choices - len(vals)) * mean_with_missing**2
return (ss_present + ss_missing) / (a_choices - 1)
TableA.groupby(['X']).agg({'Y': variance_with_missing})
Approach of below solution is appending not existing sequence with Y=0. Little messy but hope this will help.
import numpy as np
import pandas as pd
TableA = pd.DataFrame({'A':[1,1,2,2,2,3,3],
'X':[1,2,1,2,3,1,2],
'Y':[30,20,15,20,20,30,35]})
TableA['A'] = TableA['A'].astype(int)
#### Create row with non existing sequence and fill with 0 ####
for i in range(1,TableA.X.max()+1):
for j in TableA.A.unique():
if not TableA[(TableA.X==i) & (TableA.A==j)]['Y'].values :
TableA = TableA.append(pd.DataFrame({'A':[j],'X':[i],'Y':[0]}),ignore_index=True)
TableA.groupby('X').agg({'Y':np.var})
I have a very big dataframe (20.000.000+ rows) that containes a column called 'sequence' amongst others.
The 'sequence' column is calculated from a time series applying a few conditional statements. The value "2" flags the start of a sequence, the value "3" flags the end of a sequence, the value "1" flags a datapoint within the sequence and the value "4" flags datapoints that need to be ignored. (Note: the flag values doesn't necessarily have to be 1,2,3,4)
What I want to achieve is a continous ID value (written in a seperate column - see 'desired_Id_Output' in the example below) that labels the slices of sequences from 2 - 3 in a unique fashion (the length of a sequence is variable ranging from 2 [Start+End only] to 5000+ datapoints) to be able to do further groupby-calculations on the individual sequences.
index sequence desired_Id_Output
0 2 1
1 1 1
2 1 1
3 1 1
4 1 1
5 3 1
6 2 2
7 1 2
8 1 2
9 3 2
10 4 NaN
11 4 NaN
12 2 3
13 3 3
Thanks in advance and BR!
I can't think of anything better than the "dumb" solution of looping through the entire thing, something like this:
import numpy as np
counter = 0
tmp = np.empty_like(df['sequence'].values, dtype=np.float)
for i in range(len(tmp)):
if df['sequence'][i] == 4:
tmp[i] = np.nan
else:
if df['sequence'][i] == 2:
counter += 1
tmp[i] = counter
df['desired_Id_output'] = tmp
Of course this is going to be pretty slow with a 20M-sized DataFrame. One way to improve this is via just-in-time compilation with numba:
import numba
#numba.njit
def foo(sequence):
# put in appropriate modification of the above code block
return tmp
and call this with argument df['sequence'].values.
Does it work to count the sequence starts? And then just set the ignore values (flag 4) afterwards. Like this:
sequence_starts = df.sequence == 2
sequence_ignore = df.sequence == 4
sequence_id = sequence_starts.cumsum()
sequence_id[sequence_ignore] = numpy.nan
I have a matrix as shown below (taken from a txt file with an argument), and every cell has neighbors. Once you pick a cell, that cell and all neighboring cells that containing the same number will disappear.
1 0 4 7 6 8
0 5 4 4 5 5
2 1 4 4 4 6
4 1 3 7 4 4
I've tried to do this with using recursion. I separated function four parts which are up(), down() , left() and right(). But I got an error message: RecursionError: maximum recursion depth exceeded in comparison
cmd=input("Row,column:")
cmdlist=command.split(",")
row,column=int(cmdlist[0]),int(cmdlist[1])
num=lines[row-1][column-1]
def up(x,y):
if lines[x-2][y-1]==num and x>1:
left(x,y)
right(x,y)
lines[x-2][y-1]=None
def left(x,y):
if lines[x-1][y-2]==num and y>1:
up(x,y)
down(x,y)
lines[x-1][y-2]=None
def right(x,y):
if lines[x-1][y]==num and y<len(lines[row-1]):
up(x,y)
down(x,y)
lines[x-1][y]=None
def down(x,y):
if lines[x][y-1]==num and x<len(lines):
left(x,y)
right(x,y)
lines[x][y-1]=None
up(row,column)
down(row,column)
for i in lines:
print(str(i).strip("[]").replace(",","").replace("None"," "))
When I give the input (3,3) which represents the number of "4", the output must be like this:
1 0 7 6 8
0 5 5 5
2 1 6
4 1 3 7
I don't need fixed code, just the main idea will be enough. Thanks a lot.
Recursion error happens when your recursion does not terminate.
You can solve this without recursing using set's of indexes:
search all indexes that contain the looked for number into all_num_idx
add the index you are currently at (your input) to a set tbd (to be deleted)
loop over the tbd and add all indexed from all_num_idx that differ only in -1/+1 in row or col to any index thats already in the set
do until tbd does no longer grow
delete all indexes from tbd:
t = """4 0 4 7 6 8
0 5 4 4 5 5
2 1 4 4 4 6
4 1 3 7 4 4"""
data = [k.strip().split() for k in t.splitlines()]
row,column=map(int,input("Row,column:").strip().split(";"))
num = data[row][column]
len_r =len(data)
len_c = len(data[0])
all_num_idx = set((r,c) for r in range(len_r) for c in range(len_c) if data[r][c]==num)
tbd = set( [ (row,column)] ) # inital field
tbd_size = 0 # different size to enter while
done = set() # we processed those already
while len(tbd) != tbd_size: # loop while growing
tbd_size=len(tbd)
for t in tbd:
if t in done:
continue
# only 4-piece neighbourhood +1 or -1 in one direction
poss_neighbours = set( [(t[0]+1,t[1]), (t[0],t[1]+1),
(t[0]-1,t[1]), (t[0],t[1]-1)] )
# 8-way neighbourhood with diagonals
# poss_neighbours = set((t[0]+a,t[1]+b) for a in range(-1,2) for b in range(-1,2))
tbd = tbd.union( poss_neighbours & all_num_idx)
# reduce all_num_idx by all those that we already addded
all_num_idx -= tbd
done.add(t)
# delete the indexes we collected
for r,c in tbd:
data[r][c]=None
# output
for line in data:
print(*(c or " " for c in line) , sep=" ")
Output:
Row,column: 3,4
4 0 7 6 8
0 5 5 5
2 1 6
4 1 3 7
This is a variant of a "flood-fill-algorythm" flooding only cells of a certain value. See https://en.wikipedia.org/wiki/Flood_fill
Maybe you should replace
def right(x,y):
if lines[x-1][y]==num and y<len(lines[row-1]):
up(x,y)
down(x,y)
lines[x-1][y]=None
by
def right(x,y):
if lines[x-1][y]==num and y<len(lines[row-1]):
lines[x-1][y]=None
up(x - 1,y)
down(x - 1,y)
right(x - 1, y)
and do the same for all the other functions.
Putting lines[x-1][y]=None ensure that your algorithm stops and changing the indices ensure that the next step of your algorithm will start from the neighbouring cell.
I am new in Python and am currently facing an issue I can't solve. I really hope you can help me out. English is not my native languge so I am sorry if I am not able to express myself properly.
Say I have a simple data frame with two columns:
index Num_Albums Num_authors
0 10 4
1 1 5
2 4 4
3 7 1000
4 1 44
5 3 8
Num_Abums_tot = sum(Num_Albums) = 30
I need to do a cumulative sum of the data in Num_Albums until a certain condition is reached. Register the index at which the condition is achieved and get the correspondent value from Num_authors.
Example:
cumulative sum of Num_Albums until the sum equals 50% ± 1/15 of 30 (--> 15±2):
10 = 15±2? No, then continue;
10+1 =15±2? No, then continue
10+1+41 = 15±2? Yes, stop.
Condition reached at index 2. Then get Num_Authors at that index: Num_Authors(2)=4
I would like to see if there's a function already implemented in pandas, before I start thinking how to do it with a while/for loop....
[I would like to specify the column from which I want to retrieve the value at the relevant index (this comes in handy when I have e.g. 4 columns and i want to sum elements in column 1, condition achieved =yes then get the correspondent value in column 2; then do the same with column 3 and 4)].
Opt - 1:
You could compute the cumulative sum using cumsum. Then use np.isclose with it's inbuilt tolerance parameter to check if the values present in this series lies within the specified threshold of 15 +/- 2. This returns a boolean array.
Through np.flatnonzero, return the ordinal values of the indices for which the True condition holds. We select the first instance of a True value.
Finally, use .iloc to retrieve value of the column name you require based on the index computed earlier.
val = np.flatnonzero(np.isclose(df.Num_Albums.cumsum().values, 15, atol=2))[0]
df['Num_authors'].iloc[val] # for faster access, use .iat
4
When performing np.isclose on the series later converted to an array:
np.isclose(df.Num_Albums.cumsum().values, 15, atol=2)
array([False, False, True, False, False, False], dtype=bool)
Opt - 2:
Use pd.Index.get_loc on the cumsum calculated series which also supports a tolerance parameter on the nearest method.
val = pd.Index(df.Num_Albums.cumsum()).get_loc(15, 'nearest', tolerance=2)
df.get_value(val, 'Num_authors')
4
Opt - 3:
Use idxmax to find the first index of a True value for the boolean mask created after sub and abs operations on the cumsum series:
df.get_value(df.Num_Albums.cumsum().sub(15).abs().le(2).idxmax(), 'Num_authors')
4
I think you can directly add a column with the cumulative sum as:
In [3]: df
Out[3]:
index Num_Albums Num_authors
0 0 10 4
1 1 1 5
2 2 4 4
3 3 7 1000
4 4 1 44
5 5 3 8
In [4]: df['cumsum'] = df['Num_Albums'].cumsum()
In [5]: df
Out[5]:
index Num_Albums Num_authors cumsum
0 0 10 4 10
1 1 1 5 11
2 2 4 4 15
3 3 7 1000 22
4 4 1 44 23
5 5 3 8 26
And then apply the condition you want on the cumsum column. For instance you can use where to get the full row according to the filter. Setting the tolerance tol:
In [18]: tol = 2
In [19]: cond = df.where((df['cumsum']>=15-tol)&(df['cumsum']<=15+tol)).dropna()
In [20]: cond
Out[20]:
index Num_Albums Num_authors cumsum
2 2.0 4.0 4.0 15.0
This could even be done as following code:
def your_function(df):
sum=0
index=-1
for i in df['Num_Albums'].tolist():
sum+=i
index+=1
if sum == ( " your_condition " ):
return (index,df.loc([df.Num_Albums==i,'Num_authors']))
This would actually return a tuple of your index and the corresponding value of Num_authors as soon as the "your condition" is reached.
or could even be returned as an array by
def your_function(df):
sum=0
index=-1
for i in df['Num_Albums'].tolist():
sum+=i
index+=1
if sum == ( " your_condition " ):
return df.loc([df.Num_Albums==i,'Num_authors']).index.values
I am not able to figure out the condition you mentioned of the cumulative sum as when to stop summing so I mentioned it as " your_condition " in the code!!
I am also new so hope it helps !!