I am trying do something like that.
ff = pd.DataFrame({'uid':[1,1,1,20,20,20,4,4,4],
'date':['09/06','10/06','11/06',
'09/06','10/06','11/06',
'09/06','10/06','11/06'],
'balance':[150,200,230,12,15,15,700,1000,1500],
'difference':[np.NaN,50,30,np.NaN,3,0,np.NaN,300,500]})
I have tried with rolling, but I cannot find the function or the rolling sub-class that subtracts, only sum and var and other stats.
Is there a way?
I was thinking that I can create two dfs: one - with the first row of every uid eliminated, the second one - with the last row of every uid eliminated. But to be honest, I have no idea how to do that dynamically, for every uid.
Use groupby with diff:
df = pd.DataFrame({'uid':[1,1,1,20,20,20,4,4,4],
'date':['09/06','10/06','11/06',
'09/06','10/06','11/06',
'09/06','10/06','11/06'],
'balance':[150,200,230,12,15,15,700,1000,1500]})
df['difference'] = df.groupby('uid')['balance'].diff()
Output:
uid date balance difference
0 1 09/06 150 NaN
1 1 10/06 200 50.0
2 1 11/06 230 30.0
3 20 09/06 12 NaN
4 20 10/06 15 3.0
5 20 11/06 15 0.0
6 4 09/06 700 NaN
7 4 10/06 1000 300.0
8 4 11/06 1500 500.0
Related
date data1
0 2012/1/1 100
1 2012/1/2 109
2 2012/1/3 108
3 2012/1/4 120
4 2012/1/5 80
5 2012/1/6 130
6 2012/1/7 100
7 2012/1/8 140
Given the dataframe above, I want get the number of rows which data1 value is between ± 10 of each row's data1 field, and append that count to each row, such that:
date data Count
0 2012/1/1 100.0 4.0
1 2012/1/2 109.0 4.0
2 2012/1/3 108.0 4.0
3 2012/1/4 120.0 2.0
4 2012/1/5 80.0 1.0
5 2012/1/6 130.0 3.0
6 2012/1/7 100.0 4.0
7 2012/1/8 140.0 2.0
Since each row's field is rule's compare object, I use iterrows, although I know this is not elegant:
result = pd.DataFrame(index=df.index)
for i,r in df.iterrows():
high=r['data']+10
low=r['data1']-10
df2=df.loc[(df['data']<=r['data']+10)&(df['data']>=r['data']-10)]
result.loc[i,'date']=r['date']
result.loc[i,'data']=r['data']
result.loc[i,'count']=df2.shape[0]
result
Is there any more Pandas-style way to do that?
Thank you for any help!
Use numpy broadcasting for boolean mask and for count Trues use sum:
arr = df['data'].to_numpy()
df['count'] = ((arr[:, None] <= arr+10)&(arr[:, None] >= arr-10)).sum(axis=1)
print (df)
date data count
0 2012/1/1 100 4
1 2012/1/2 109 4
2 2012/1/3 108 4
3 2012/1/4 120 2
4 2012/1/5 80 1
5 2012/1/6 130 3
6 2012/1/7 100 4
7 2012/1/8 140 2
I have a dataframe with the following structure:
1995 1996
AT1 3 6
AT2 5 3
AT3 2 1
FR1 1 1
FR5 2 1
FR7 7 8
I would like to add columns or create a dataframe containing the percentage of each row over the total, depending on the groups indicated by the first two letters.
Basically, for each column:
Sum the values of each group of rows (i.e sum all the rows starting
by AT, then all the rows starting with FR...).
Divide each row in the different groups by its sum and multiply by
100.
Put these values in a new column or a new dataframe.
The expected output would be:
1995 1996 Percentage_1995 Percentage_1996
AT1 3 6 30 60
AT2 5 3 50 30
AT3 2 1 20 10
FR1 1 1 10 10
FR5 2 1 20 10
FR7 7 8 70 80
I know it may sound confusing so I apologize if I'm not very clear. I would appreciate any help you could provide. Thank you in advance.
You can use GroupBy.transform with df.div to divide and df.mul to multiply by 100 then use df.assign
temp = df.div(
df.groupby(df.index.str[:2]).transform("sum")
).mul(100).add_prefix("Percentage")
df.assign(**temp)
1995 1996 Percentage1995 Percentage1996
AT1 3 6 30.0 60.0
AT2 5 3 50.0 30.0
AT3 2 1 20.0 10.0
FR1 1 1 10.0 10.0
FR5 2 1 20.0 10.0
FR7 7 8 70.0 80.0
I am trying to create a new column that will list down the last recorded peak values, until the next peak comes along. For example, suppose this is my existing DataFrame:
index values
0 10
1 20
2 15
3 17
4 15
5 22
6 20
I want to get something like this:
index values last_recorded_peak
0 10 10
1 20 20
2 15 20
3 17 17
4 15 17
5 22 22
6 20 22
So far, I have tried with np.maximum.accumulate, which 'accumulates' the max value but not quite the "peaks" (some peaks might be lower than the max value).
I have also tried with scipy.signal.find_peaks which returns an array of indexes where my peaks are (in the example, index 1, 3, 5), which is not what I'm looking for.
I'm relatively new to coding, any pointer is very much appreciated!
You're on the right track, scipy.signal.find_peaks is the way I would go, you just need to work a little bit from the result:
from scipy import signal
peaks = signal.find_peaks(df['values'])[0]
df['last_recorded_peak'] = (df.assign(last_recorded_peak=float('nan'))
.last_recorded_peak
.combine_first(df.loc[peaks,'values'])
.ffill()
.combine_first(df['values']))
print(df)
index values last_recorded_peak
0 0 10 10.0
1 1 20 20.0
2 2 15 20.0
3 3 17 17.0
4 4 15 17.0
5 5 22 22.0
6 6 20 22.0
If I understand your correcly, your are looking for rolling max:
note: you might have to play around with the window size which I set on 2 for your example dataframe
df['last_recorded_peak'] = df['values'].rolling(2).max().fillna(df['values'])
Output
values last_recorded_peak
0 10 10.0
1 20 20.0
2 15 20.0
3 17 17.0
4 15 17.0
5 22 22.0
6 20 22.0
I have a pandas DataFrame of the form:
df
ID_col time_in_hours data_col
1 62.5 4
1 40 3
1 20 3
2 30 1
2 20 5
3 50 6
What I want to be able to do is, find the rate of change of data_col by using the time_in_hours column. Specifically,
rate_of_change = (data_col[i+1] - data_col[i]) / abs(time_in_hours[ i +1] - time_in_hours[i])
Where i is a given row and the rate_of_change is calculated separately for different IDs
Effectively, I want a new DataFrame of the form:
new_df
ID_col time_in_hours data_col rate_of_change
1 62.5 4 NaN
1 40 3 -0.044
1 20 3 0
2 30 1 NaN
2 20 5 0.4
3 50 6 NaN
How do I go about this?
You can use groupby:
s = df.groupby('ID_col').apply(lambda dft: dft['data_col'].diff() / dft['time_in_hours'].diff().abs())
s.index = s.index.droplevel()
s
returns
0 NaN
1 -0.044444
2 0.000000
3 NaN
4 0.400000
5 NaN
dtype: float64
You can actually get around the groupby + apply given how your DataFrame is sorted. In this case, you can just check if the ID_col is the same as the shifted row.
So calculate the rate of change for everything, and then only assign the values back if they are within a group.
import numpy as np
mask = df.ID_col == df.ID_col.shift(1)
roc = (df.data_col - df.data_col.shift(1))/np.abs(df.time_in_hours - df.time_in_hours.shift(1))
df.loc[mask, 'rate_of_change'] = roc[mask]
Output:
ID_col time_in_hours data_col rate_of_change
0 1 62.5 4 NaN
1 1 40.0 3 -0.044444
2 1 20.0 3 0.000000
3 2 30.0 1 NaN
4 2 20.0 5 0.400000
5 3 50.0 6 NaN
You can use pandas.diff:
df.groupby('ID_col').apply(
lambda x: x['data_col'].diff() / x['time_in_hours'].diff().abs())
ID_col
1 0 NaN
1 -0.044444
2 0.000000
2 3 NaN
4 0.400000
3 5 NaN
dtype: float64
i have a table in pandas df
id count
0 10 3
1 20 4
2 30 5
3 40 NaN
4 50 NaN
5 60 NaN
6 70 NaN
also i have another pandas series s
0 1000
1 2000
2 3000
3 4000
what i want to do is replace the NaN values in my df with the respective values from series s.
my final output should be
id count
0 10 3
1 20 4
2 30 5
3 40 1000
4 50 2000
5 60 3000
6 70 4000
Any ideas how do achieve this?
Thanks in advance.
There is problem lenght of Series is different as length of NaN values in column count. So you need reindex Series by length of NaN:
s = pd.Series({0: 1000, 1: 2000, 2: 3000, 3: 4000, 5: 5000})
print (s)
0 1000
1 2000
2 3000
3 4000
5 5000
dtype: int64
df.loc[df['count'].isnull(), 'count'] =
s.reindex(np.arange(df['count'].isnull().sum())).values
print (df)
id count
0 10 3.0
1 20 4.0
2 30 5.0
3 40 1000.0
4 50 2000.0
5 60 3000.0
6 70 4000.0
It's as simple as this:
df.count[df.count.isnull()] = s.values
In this case, I prefer iterrows for its readability.
counter = 0
for index, row in df.iterrows():
if row['count'].isnull():
df.set_value(index, 'count', s[counter])
counter += 1
I might add that this 'merging' of dataframe + series is a bit odd, and prone to bizarre errors. If you can somehow get the series into the same format as the dataframe (aka add some index/column tags, then you might be better served by the merge function).
You can re-index your Series with indexes of np.nan from dataframe and than fillna() with your Series:
s.index = np.where(df['count'].isnull())[0]
df['count'] = df['count'].fillna(s)
print(df)
id count
0 10 3.0
1 20 4.0
2 30 5.0
3 40 1000.0
4 50 2000.0
5 60 3000.0
6 70 4000.0