I would like to calculate the difference (derivative) between contiguous values, for example:
list = 1, 3, 7, 6
list_diff = NaN, 2, 4, -1
The case above only works when there are no NaNs in the middle of the values. In the case below, I would like to know the grade difference to see how a student's learning is evolving during time. The problem is that some grades are missing! We still want to calculate that difference (only if there are at most 2 missing grades in the middle).
How can I do this?
df:
GRD1 GRD2 GRD3 GRD4 GRD5 GRD6 GRD7
001 1 6 5 9 1 7 9
002 5 8 NaN 8' NaN NaN 2'
003 7 *8* NaN NaN NaN *2* 6
df_diff:
GRD1 GRD2 GRD3 GRD4 GRD5 GRD6 GRD7
001 NaN 5 -1 4 -8 6 2
002 NaN 3 NaN 0 NaN NaN -6'
003 NaN 1 NaN NaN NaN *NaN* 4
See dataframe df: Note for students 001 and 002, the differences between grades are calculated even if NaNs are in the middle because they only have at most 2 missing grades. E.g. 2' - 8' = -6'
However, student 003 has a gap of 3 missing grades, so, the difference in this case will not be calculated. E.g. *2* - *8* = *NaN*.
Use ffill with limit parameter for forward filling only 2 values before DataFrame.diff and then replace 0 differences by original NaNs by DataFrame.mask:
df = df.ffill(axis=1, limit=2).diff(axis=1).mask(df.isna())
print (df)
GRD1 GRD2 GRD3 GRD4 GRD5 GRD6 GRD7
1 NaN 5.0 -1.0 4.0 -8.0 6.0 2.0
2 NaN 3.0 NaN 0.0 NaN NaN -6.0
3 NaN 1.0 NaN NaN NaN NaN 4.0
Details:
print (df.ffill(axis=1, limit=2))
GRD1 GRD2 GRD3 GRD4 GRD5 GRD6 GRD7
1 1.0 6.0 5.0 9.0 1.0 7.0 9.0
2 5.0 8.0 8.0 8.0 8.0 8.0 2.0
3 7.0 8.0 8.0 8.0 NaN 2.0 6.0
print (df.ffill(axis=1, limit=2).diff(axis=1))
GRD1 GRD2 GRD3 GRD4 GRD5 GRD6 GRD7
1 NaN 5.0 -1.0 4.0 -8.0 6.0 2.0
2 NaN 3.0 0.0 0.0 0.0 0.0 -6.0
3 NaN 1.0 0.0 0.0 NaN NaN 4.0
Related
What is the most pandastic way to forward fill with ascending logic (without iterating over the rows)?
input:
import pandas as pd
import numpy as np
df = pd.DataFrame()
df['test'] = np.nan,np.nan,1,np.nan,np.nan,3,np.nan,np.nan,2,np.nan,6,np.nan,np.nan
df['desired_output'] = np.nan,np.nan,1,1,1,3,3,3,3,3,6,6,6
print (df)
output:
test desired_output
0 NaN NaN
1 NaN NaN
2 1.0 1.0
3 NaN 1.0
4 NaN 1.0
5 3.0 3.0
6 NaN 3.0
7 NaN 3.0
8 2.0 3.0
9 NaN 3.0
10 6.0 6.0
11 NaN 6.0
12 NaN 6.0
In the 'test' column, the number of consecutive NaN's is random.
In the 'desired_output' column, trying to forward fill with ascending values only. Also, when lower values are encountered (row 8, value = 2.0 above), they are overwritten with the current higher value.
Can anyone help? Thanks in advance.
You can combine cummax to select the cumulative maximum value and ffill to replace the NaNs:
df['desired_output'] = df['test'].cummax().ffill()
output:
test desired_output
0 NaN NaN
1 NaN NaN
2 1.0 1.0
3 NaN 1.0
4 NaN 1.0
5 3.0 3.0
6 NaN 3.0
7 NaN 3.0
8 2.0 3.0
9 NaN 3.0
10 6.0 6.0
11 NaN 6.0
12 NaN 6.0
intermediate Series:
df['test'].cummax()
0 NaN
1 NaN
2 1.0
3 NaN
4 NaN
5 3.0
6 NaN
7 NaN
8 3.0
9 NaN
10 6.0
11 NaN
12 NaN
Name: test, dtype: float64
I have the following data frame which I want to apply bfill as follows:
'amount'
'percentage'
Nan
1.0
20
2.0
10
Nan
Nan
Nan
Nan
3.0
50
4.0
10
Nan
5.0
10
I want to bfill Nan in the amount column as per percentage in the percentage column i.e., if the corresponding percentage is 50 then fill 50% of Nan before the number (partial fill). e.g. amount with 3.0 value have a percentage of 50 so out of 4 Nan entries, only 50% are to be bfill.
proposed output:
'amount'
'percentage'
Nan
1.0
20
2.0
10
Nan
Nan
3.0
3.0
3.0
50
4.0
10
Nan
5.0
10
Please help.
Create groups according to NaNs
df['group_id'] = df.amount.where(df.amount.isna(), 1).cumsum().bfill()
Create a filling function
def custom_fill(x):
# Calculate number of rows to be filled
max_fill_rows = math.floor(x.iloc[-1, 1] * (x.shape[0] - 1) / 100)
# Fill only if number of rows to fill is not zero
return x.bfill(limit=max_fill_rows) if max_fill_rows else x
Fill the DataFrame
df.groupby('group_id').apply(custom_fill)
Output
amount percentage group_id
0 NaN NaN 1.0
1 1.0 20.0 1.0
2 2.0 10.0 2.0
3 NaN NaN 3.0
4 NaN NaN 3.0
5 3.0 50.0 3.0
6 3.0 50.0 3.0
7 3.0 50.0 3.0
8 4.0 10.0 4.0
9 NaN NaN 5.0
10 5.0 10.0 5.0
PS: Don't forget to import the required libraries
import math
Is there a Pythonic way to, in a timeseries dataframe, by column, go down and pick the first number in a sequence, and then push it forward until the next NaN, and then take the next non-NaN number and push that one down until the next NaN, and so on (retaining the indices and NaNs).
For example, I would like to convert this dataframe:
DF = pd.DataFrame(data={'A':[np.nan,1,3,5,7,np.nan,2,4,6,np.nan], 'B':[8,6,4,np.nan,np.nan,9,7,3,np.nan,3], 'C':[np.nan,np.nan,4,2,6,np.nan,1,5,2,8]})
A B C
0 NaN 8.0 NaN
1 1.0 6.0 NaN
2 3.0 4.0 4.0
3 5.0 NaN 2.0
4 7.0 NaN 6.0
5 NaN 9.0 NaN
6 2.0 7.0 1.0
7 4.0 3.0 5.0
8 6.0 NaN 2.0
9 NaN 3.0 8.0
To this dataframe:
Result = pd.DataFrame(data={'A':[np.nan,1,1,1,1,np.nan,2,2,2,np.nan], 'B':[8,8,8,np.nan,np.nan,9,9,9,np.nan,3], 'C':[np.nan,np.nan,4,4,4,np.nan,1,1,1,1]})
A B C
0 NaN 8.0 NaN
1 1.0 8.0 NaN
2 1.0 8.0 4.0
3 1.0 NaN 4.0
4 1.0 NaN 4.0
5 NaN 9.0 NaN
6 2.0 9.0 1.0
7 2.0 9.0 1.0
8 2.0 NaN 1.0
9 NaN 3.0 1.0
I know I can use a loop to iterate down the columns to do this, but would appreciate some help on how to do it in a more efficient Pythonic way on a very large dataframe. Thank you.
IIUC:
# where DF is not NaN
mask = DF.notna()
Result = (DF.shift(-1) # fill the original NaN's with their next value
.mask(mask) # replace all the original non-NaN with NaN
.ffill() # forward fill
.fillna(DF.iloc[0]) # starting of the the columns with a non-NaN
.where(mask) # replace the original NaN's back
)
Output:
A B C
0 NaN 8.0 NaN
1 1.0 8.0 NaN
2 1.0 8.0 4.0
3 1.0 NaN 4.0
4 1.0 NaN 4.0
5 NaN 9.0 NaN
6 2.0 9.0 1.0
7 2.0 9.0 1.0
8 2.0 NaN 1.0
9 NaN 3.0 1.0
I have the following data:
data = [[1,2,3], [1,2,3,4,5], [1,2,3,4,5,6,7]]
dataFrame = pandas.DataFrame(data).transpose()
Output:
0 1 2
0 1.0 1.0 1.0
1 2.0 2.0 2.0
2 3.0 3.0 3.0
3 NaN 4.0 4.0
4 NaN 5.0 5.0
5 NaN NaN 6.0
6 NaN NaN 7.0
Is it possible to create a 4th column AT THE SAME TIME the others columns are created in data, which has the same length as the longest column of this dataframe (3rd one)?
The data of this column doesn't matter. Assume it's 8. So this is the desired output can be:
0 1 2 3
0 1.0 1.0 1.0 8.0
1 2.0 2.0 2.0 8.0
2 3.0 3.0 3.0 8.0
3 NaN 4.0 4.0 8.0
4 NaN 5.0 5.0 8.0
5 NaN NaN 6.0 8.0
6 NaN NaN 7.0 8.0
In my script the dataframe keeps changing every time. This means the longest columns keeps changing with it.
Thanks for reading
This is quite similar to answers from #jpp, #Cleb, and maybe some other answers here, just slightly simpler:
data = [[1,2,3], [1,2,3,4,5], [1,2,3,4,5,6,7]] + [[]]
This will automatically give you a column of NaNs that is the same length as the longest columnn, so you don't need the extra work of calculating the length of the longest column. Resulting dataframe:
0 1 2 3
0 1.0 1.0 1.0 NaN
1 2.0 2.0 2.0 NaN
2 3.0 3.0 3.0 NaN
3 NaN 4.0 4.0 NaN
4 NaN 5.0 5.0 NaN
5 NaN NaN 6.0 NaN
6 NaN NaN 7.0 NaN
Note that this answer is less general than some others here (such as by #jpp & #Cleb) in that it will only fill with NaNs. If you want some default fill values other than NaN, you should use one of their answers.
You can append to a list which then immediately feeds the pd.DataFrame constructor:
import pandas as pd
data = [[1,2,3], [1,2,3,4,5], [1,2,3,4,5,6,7]]
df = pd.DataFrame(data + [[8]*max(map(len, data))]).transpose()
print(df)
0 1 2 3
0 1.0 1.0 1.0 8.0
1 2.0 2.0 2.0 8.0
2 3.0 3.0 3.0 8.0
3 NaN 4.0 4.0 8.0
4 NaN 5.0 5.0 8.0
5 NaN NaN 6.0 8.0
6 NaN NaN 7.0 8.0
But this is inefficient. Pandas uses NumPy to hold underlying series and setting a series to a constant value is trivial and efficient; you can simply use:
df[3] = 8
It is not entirely clear what you mean by at the same time, but the following would work:
import pandas as pd
data = [[1,2,3], [1,2,3,4,5], [1,2,3,4,5,6,7]]
# get the longest list in data
data.append([8] * max(map(len, data)))
pd.DataFrame(data).transpose()
yielding
0 1 2 3
0 1.0 1.0 1.0 8.0
1 2.0 2.0 2.0 8.0
2 3.0 3.0 3.0 8.0
3 NaN 4.0 4.0 8.0
4 NaN 5.0 5.0 8.0
5 NaN NaN 6.0 8.0
6 NaN NaN 7.0 8.0
If you'd like to do it as you create the DataFrame, simply chain a call to assign:
pd.DataFrame(data).T.assign(**{'3': 8})
0 1 2 3
0 1.0 1.0 1.0 8
1 2.0 2.0 2.0 8
2 3.0 3.0 3.0 8
3 NaN 4.0 4.0 8
4 NaN 5.0 5.0 8
5 NaN NaN 6.0 8
6 NaN NaN 7.0 8
You can do a def (read comments):
def f(df):
l=[8]*df[max(df,key=lambda x:df[x].count())].count()
df[3]=l+[np.nan]*(len(df)-len(l))
# the above two lines can be just `df[3] = another solution currently for this problem`
return df
dataFrame = f(pandas.DataFrame(data).transpose())
Then now:
print(dataFrame)
Returns:
0 1 2 3
0 1.0 1.0 1.0 8
1 2.0 2.0 2.0 8
2 3.0 3.0 3.0 8
3 NaN 4.0 4.0 8
4 NaN 5.0 5.0 8
5 NaN NaN 6.0 8
6 NaN NaN 7.0 8
If at you mean at the same time as running pd.DataFrame, the data has to be prepped before it is loaded to your frame.
data = [[1,2,3], [1,2,3,4,5], [1,2,3,4,5,6,7]]
longest = max(len(i) for i in data)
dummy = [8 for i in range(longest)] #dummy data filled with 8
data.append(dummy)
dataFrame = pd.DataFrame(data).transpose()
The example above gets the longest element in your list and creates a dummy to be added onto it before creating your dataframe.
One solution is to add an element to the list that is passed to the dataframe:
pd.DataFrame(data + [[np.hstack(data).max() + 1] * len(max(data))]).T
0 1 2 3
0 1.0 1.0 1.0 8.0
1 2.0 2.0 2.0 8.0
2 3.0 3.0 3.0 8.0
3 NaN 4.0 4.0 8.0
4 NaN 5.0 5.0 8.0
5 NaN NaN 6.0 8.0
6 NaN NaN 7.0 8.0
If data is to be modified just:
data = [[1,2,3], [1,2,3,4,5], [1,2,3,4,5,6,7]]
data = data + [[np.hstack(data).max() + 1] * len(max(data))]
pd.DataFrame(data).T
I have a dataframe of race results. I'd like to create a series that takes the last stage position and subtracts that by the average of all the stages before that. Here is a small slice for the df (could have more stages, countries and rows)
race_location stage1_position stage2_position stage3_position number_of_stages
AUS 2.0 2.0 NaN 2
AUS 1.0 5.0 NaN 2
AUS 3.0 4.0 NaN 2
AUS 4.0 8.0 NaN 2
AUS 10.0 6.0 NaN 2
AUS 9.0 7.0 NaN 2
FRA 23.0 1.0 10.0 3
FRA 6.0 12.0 24.0 3
FRA 14.0 11.0 14.0 3
FRA 18.0 10.0 1.0 3
FRA 15.0 14.0 4.0 3
USA 24.0 NaN NaN 1
USA 7.0 NaN NaN 1
USA 22.0 NaN NaN 1
USA 11.0 NaN NaN 1
USA 8.0 NaN NaN 1
USA 16.0 NaN NaN 1
USA 13.0 NaN NaN 1
USA 19.0 NaN NaN 1
USA 5.0 NaN NaN 1
USA 25.0 NaN NaN 1
The output would be
last_stage_minus_average
0
4
1
4
-4
-2
-2
15
1.5
-13
-10.5
0
0
0
0
0
0
0
0
0
0
0
This wont work, but I was thinking something like this:
new_series = []
for country in country_list:
num_stages = df.loc[df['race_location'] == country, 'number_of_stages']
differnce = df.ix[df['race_location'] == country, num_stages] -
df.iloc[:, 0:num_stages-1].mean(axis=1)
new_series.append(difference)
I'm not sure how to go about doing this. Any help or direction would be amazing!
#use pandas apply to take the mean for the first n-1 stages and subtract from last stage.
df.apply(lambda x: x.iloc[x.number_of_stages]-np.mean(x.iloc[1:x.number_of_stages]),axis=1).fillna(0)
Out[264]:
0 0.0
1 4.0
2 1.0
3 4.0
4 -4.0
5 -2.0
6 -2.0
7 15.0
8 1.5
9 -13.0
10 -10.5
11 0.0
12 0.0
13 0.0
14 0.0
15 0.0
16 0.0
17 0.0
18 0.0
19 0.0
20 0.0
dtype: float64
I'd use filter to get just he stage columns, then stack and groupby
stages = df.filter(regex='^stage\d+.*')
stages.stack().groupby(level=0).apply(
lambda x: x.iloc[-1] - x.iloc[:-1].mean()
).fillna(0)
0 0.0
1 4.0
2 1.0
3 4.0
4 -4.0
5 -2.0
6 -2.0
7 15.0
8 1.5
9 -13.0
10 -10.5
11 0.0
12 0.0
13 0.0
14 0.0
15 0.0
16 0.0
17 0.0
18 0.0
19 0.0
20 0.0
dtype: float64
how it works
stack will automatically drop the NaN values when converting to a series.
Now, position -1 is the last value within each group if we grouped by the first level of the new multiindex
So, we use a lambda and calculate the mean with every thing up to the last value x.iloc[:-1].mean()
And subtract that from the last value x.iloc[-1]
subtracts that by the average of all the stages before that
It's not a big deal but I'm just curious! Unlike your desired output but along to your description, if one of the racers finished only one race, shouldn't their result be inf or nan instead of 0? (to specify them from the one who has already done 2~3 race but last race result is exactly same with average of races? like racer #1 vs racer #11~20)
df_sp = df.filter(regex='^stage\d+.*')
df['last'] = df_sp.T.fillna(method='ffill').T.iloc[:, -1]
df['mean'] = (df_sp.sum(axis=1) - df['last']) / (df['number_of_stages'] - 1)
print(df['last'] - df['mean'])
0 0.0
1 4.0
2 1.0
3 4.0
4 -4.0
5 -2.0
6 -2.0
7 15.0
8 1.5
9 -13.0
10 -10.5
11 NaN
12 NaN
13 NaN
14 NaN
15 NaN
16 NaN
17 NaN
18 NaN
19 NaN
20 NaN