I have a pandas dataframe like so:
id val
1 10
1 20
2 19
2 21
2 15
Now I want to groupby id and calculate weight column as 1/count of rows in each group. So final dataframe will be like:
id val weight
1 10 0.5
1 20 0.5
2 19 0.33
2 21 0.33
2 15 0.33
What's the easiest way to achieve this?
Use GroupBy.transform with division:
df['weight'] = 1 / df.groupby('id')['id'].transform('size')
#alternative
#df['weight'] = df.groupby('id')['id'].transform('size').rdiv(1)
Or Series.map with Series.value_counts:
df['weight'] = 1 / df['id'].map(df['id'].value_counts())
#alternative
#df['weight'] = df['id'].map(df['id'].value_counts()).rdiv(1)
print (df)
id val weight
0 1 10 0.500000
1 1 20 0.500000
2 2 19 0.333333
3 2 21 0.333333
4 2 15 0.333333
Related
My end goal is to sum all minutes only from initial to final in column periods. This needs to be grouped by id
I have thousands of id and not all of them have the same amount of min in between initial and final.
Periods are sorted in a "journey" fashion each record represents a period of time of its id
Pseudocode:
Iterate rows and sum all values in column "min"
if sum starts in periods == initial and ends in periods = final
Example with 2 ids
id
periods
min
1
period_x
10
1
initial
2
1
progress
3
1
progress_1
4
1
final
5
2
period_y
10
2
period_z
2
2
initial
3
2
progress_1
20
2
final
3
Desired output
id
periods
min
sum
1
period_x
10
14
1
initial
2
14
1
progress
3
14
1
progress_1
4
14
1
final
5
14
2
period_y
10
26
2
period_z
2
26
2
initial
3
26
2
progress_1
20
26
2
final
3
26
So far I've tried:
L = ['initial' 'final']
df['sum'] = df.id.where( df.zone_name.isin(L)).groupby(df['if']).transform('sum')
But this doesn't count what is in between initial and final
Create groups using cumsum and then return the sum of group 1, then apply that sum to the entire column. "Group 1" is anything per id that is between initial and final:
import numpy as np
df['grp'] = df['periods'].isin(['initial','final'])
df['grp'] = np.where(df['periods'] == 'final', 1, df.groupby('id')['grp'].cumsum())
df['sum'] = np.where(df['grp'].eq(1), df.groupby(['id', 'grp'])['min'].transform('sum'), np.nan)
df['sum'] = df.groupby('id')['sum'].transform(max)
df
Out[1]:
id periods min grp sum
0 1 period_x 10 0 14.0
1 1 initial 2 1 14.0
2 1 progress 3 1 14.0
3 1 progress_1 4 1 14.0
4 1 final 5 1 14.0
5 2 period_y 10 0 26.0
6 2 period_z 2 0 26.0
7 2 initial 3 1 26.0
8 2 progress_1 20 1 26.0
9 2 final 3 1 26.0
I want to write a code where it outputs the number of repeated values in a for each different value. Then I want to make a pandas data sheet to print it. The sums code down below does not work how would I be able to make it work and get the Expected Output?
import numpy as np
import pandas as pd
a = np.array([12,12,12,3,43,43,43,22,1,3,3,43])
uniques = np.unique(a)
sums = np.sum(uniques[:-1]==a[:-1])
Expected Output:
Value Repetition Count
1 1
3 3
12 3
22 1
43 4
Define a dataframe df based on the array a. Then, use .groupby() + .size() to get the size/count of unique values, as follows:
a = np.array([12,12,12,3,43,43,43,22,1,3,3,43])
df = pd.DataFrame({'Value': a})
df.groupby('Value').size().reset_index(name='Repetition Count')
Result:
Value Repetition Count
0 1 1
1 3 3
2 12 3
3 22 1
4 43 4
Edit
If you want also the percentages of counts, you can use:
(df.groupby('Value', as_index=False)
.agg(**{'Repetition Count': ('Value', 'size'),
'Percent': ('Value', lambda x: round(x.size/len(a) *100, 2))})
)
Result:
Value Repetition Count Percent
0 1 1 8.33
1 3 3 25.00
2 12 3 25.00
3 22 1 8.33
4 43 4 33.33
or use .value_counts with normalize=True
pd.Series(a).value_counts(normalize=True).mul(100)
Result:
43 33.333333
12 25.000000
3 25.000000
22 8.333333
1 8.333333
dtype: float64
You can use groupby:
>>> pd.Series(a).groupby(a).count()
1 1
3 3
12 3
22 1
43 4
dtype: int64
Or value_counts():
>>> pd.Series(a).value_counts().sort_index()
1 1
3 3
12 3
22 1
43 4
dtype: int64
Easiest if you make a pandas dataframe from np.array and then use value_counts().
df = pd.DataFrame(data=a, columns=['col1'])
print(df.col1.value_counts())
43 4
12 3
3 3
22 1
1 1
How can I find the first element of one of the sessions (for each group) which starts a new series of continuous values?
import pandas as pd
df = pd.DataFrame({'group':[1,1,1,1,1,1,1,1,1,1, 1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,2,2,], 'value':[
1,2,3,4,5,10,11, 15, 16,17,18,19,20, # 13
21, 22,23,24,26,27.28,
4,5,6, 8,9,10,11,12, 13,14
]})
display(df)
so far I am stuck here:
df['shifted_value'] = df['value'].shift(-1)
df['difference_nect'] = df['shifted_value'] - df['value']
# this is obviously not yet correct - how can I get the first element (elemnt of 0 for each of the starting sessions)
df['session_element_index'] = df.groupby(['group']).cumcount()
df.head()
In SQL I would use a window function and compare previous/next elements to determine if a session starts/ends. Is there a nicer more pandas native way how to do this in a vectorized way?
Use DataFrameGroupBy.diff with compare not equal 1 and filter in boolean indexing:
df1 = df[df.groupby('group')['value'].diff().ne(1)]
print (df1)
group value
0 1 1.00
5 1 10.00
7 1 15.00
17 1 26.00
18 1 27.28
19 2 4.00
22 2 8.00
If need counter column:
g = df.groupby('group')['value'].apply(lambda x: x.diff().ne(1).cumsum())
df['session_element_index'] = df.groupby(g).cumcount()
print (df.head(10))
group value session_element_index
0 1 1.0 0
1 1 2.0 1
2 1 3.0 2
3 1 4.0 3
4 1 5.0 4
5 1 10.0 0
6 1 11.0 1
7 1 15.0 0
8 1 16.0 1
9 1 17.0 2
As the first approach:
out = df.groupby("group", as_index=False).value \
.apply(lambda s: ((s - s.shift()) != 1.0).cumsum() \
.drop_duplicates())
>>> out
0 0 1
5 2
7 3
17 4
18 5
1 19 1
22 2
Name: value, dtype: int64
>>> out.index.get_level_values(1)
Int64Index([0, 5, 7, 17, 18, 19, 22], dtype='int64')
There is a software that exports a table as the follow example:
import pandas as pd
s0 = ',,,,Cond1,,,,Cond2,,'.split(',')
s1 = 'Gene name,Description,Anova,FoldChange,Sample1,Sample2,Sample3,Sample4,Sample5,Sample6,Sample7'.split(',')
s2 = 'HK1,Hexokinase,0.05,1.00,1.5,1.0,0.5,1.0,0,0,0'.split(',')
df= pd.DataFrame( data= (s0, s1, s2))
0 1 2 3 4 5 6 \
0 Cond1
1 Gene name Description Anova FoldChange Sample1 Sample2 Sample3
2 HK1 Hexokinase 0.05 1.00 1.5 1.0 0.5
7 8 9 10
0 Cond2
1 Sample4 Sample5 Sample6 Sample7
2 1.0 0 0 0
H
owever, the organization of this table is not straightforward and, therefore, is hard to analyze the conditions.
I would like to produce data frames, which the conditions are matched with it respectively sample.
It should be something like the output of the following code:
import pandas as pd
s1 = 'Gene name,Description,Anova,FoldChange,Sample1.Cond1,Sample2.Cond1,Sample3.Cond1,Sample4.Cond1,Sample5.Cond2,Sample6.Cond2,Sample7.Cond2'.split(',')
s2 = 'HK1,Hexokinase,0.05,1.00,1.5,1.0,0.5,1.0,0,0,0'.split(',')
df= pd.DataFrame(data= (s1, s2))
0 1 2 3 4 5 \
0 Gene name Description Anova FoldChange Sample1.Cond1 Sample2.Cond1
1 HK1 Hexokinase 0.05 1.00 1.5 1.0
6 7 8 9 10
0 Sample3.Cond1 Sample4.Cond1 Sample5.Cond2 Sample6.Cond2 Sample7.Cond2
1 0.5 1.0 0 0 0
Enter NaN Values in row 0 with Series.where. Then fill them with Series.ffill.
Finally, you can use Series.str.cat to join both rows:
df.iloc[0]= df.iloc[1].str.cat( df.iloc[0]
.where(df.iloc[0].notnull() & df.iloc[0].ne(''))
.ffill(),'.' ).fillna(df.iloc[1])
df=df.drop(1).reset_index(drop=True)
print(df)
Output:
0 1 2 3 4 5 \
0 Gene name Description Anova FoldChange Sample1.Cond1 Sample2.Cond1
1 HK1 Hexokinase 0.05 1.00 1.5 1.0
6 7 8 9 10
0 Sample3.Cond1 Sample4.Cond1 Sample5.Cond2 Sample6.Cond2 Sample7.Cond2
1 0.5 1.0 0 0 0
I have a pandas DataFrame of the form:
df
ID_col time_in_hours data_col
1 62.5 4
1 40 3
1 20 3
2 30 1
2 20 5
3 50 6
What I want to be able to do is, find the rate of change of data_col by using the time_in_hours column. Specifically,
rate_of_change = (data_col[i+1] - data_col[i]) / abs(time_in_hours[ i +1] - time_in_hours[i])
Where i is a given row and the rate_of_change is calculated separately for different IDs
Effectively, I want a new DataFrame of the form:
new_df
ID_col time_in_hours data_col rate_of_change
1 62.5 4 NaN
1 40 3 -0.044
1 20 3 0
2 30 1 NaN
2 20 5 0.4
3 50 6 NaN
How do I go about this?
You can use groupby:
s = df.groupby('ID_col').apply(lambda dft: dft['data_col'].diff() / dft['time_in_hours'].diff().abs())
s.index = s.index.droplevel()
s
returns
0 NaN
1 -0.044444
2 0.000000
3 NaN
4 0.400000
5 NaN
dtype: float64
You can actually get around the groupby + apply given how your DataFrame is sorted. In this case, you can just check if the ID_col is the same as the shifted row.
So calculate the rate of change for everything, and then only assign the values back if they are within a group.
import numpy as np
mask = df.ID_col == df.ID_col.shift(1)
roc = (df.data_col - df.data_col.shift(1))/np.abs(df.time_in_hours - df.time_in_hours.shift(1))
df.loc[mask, 'rate_of_change'] = roc[mask]
Output:
ID_col time_in_hours data_col rate_of_change
0 1 62.5 4 NaN
1 1 40.0 3 -0.044444
2 1 20.0 3 0.000000
3 2 30.0 1 NaN
4 2 20.0 5 0.400000
5 3 50.0 6 NaN
You can use pandas.diff:
df.groupby('ID_col').apply(
lambda x: x['data_col'].diff() / x['time_in_hours'].diff().abs())
ID_col
1 0 NaN
1 -0.044444
2 0.000000
2 3 NaN
4 0.400000
3 5 NaN
dtype: float64