Variable shift in a pandas dataframe - python

import pandas as pd
df = pd.DataFrame({'A':[3,5,3,4,2,3,2,3,4,3,2,2,2,3],
'B':[10,20,30,40,20,30,40,10,20,30,15,60,20,15]})
A B
0 3 10
1 5 20
2 3 30
3 4 40
4 2 20
5 3 30
6 2 40
7 3 10
8 4 20
9 3 30
10 2 15
11 2 60
12 2 20
13 3 15
I'd like to append a C column, containing rolling average of B (rolling period = A).
For example, the C value at row index(2) should be df.B.rolling(3).mean() = mean(10,20,30), and the C value at row index(4) should be df.B.rolling(2).mean() = mean(40,20).

probably stupid slow... but this get's it done
def crazy_apply(row):
p = df.index.get_loc(row.name)
a = row.A
return df.B.iloc[p-a+1:p+1].mean()
df.apply(crazy_apply, 1)
0 NaN
1 NaN
2 20.000000
3 25.000000
4 30.000000
5 30.000000
6 35.000000
7 26.666667
8 25.000000
9 20.000000
10 22.500000
11 37.500000
12 40.000000
13 31.666667
dtype: float64
explanation
apply iterates through each column or each row. We iterate through each row because we used the parameter axis=1 (see 1 as the second argument in the call to apply). So every iteration of apply passes the a pandas series object that represents the current row. the current index value is in the name attribute of the row. The index of the row object is the same as the columns of df.
So, df.index.get_loc(row.name) finds the ordinal position of the current index value held in row.name. row.A is the column A for that row.

Related

adding column to df that calculates count of different column using groupby

I'm trying to create a new column in a df. I want the new column to equal the count of the number rows of each unique 'mother_ID, which is a different column in the df.
This is what I'm currently doing. It makes the new column but the new column is filled with 'NaN's.
df.columns = ['mother_ID', 'date_born', 'mother_mass_g', 'hatchling_masses_g']
df.to_numpy()
This is how the original df appears when I print it:
count = df.groupby('mother_ID').hatchling_masses_g.count()
df['count']= count
Pic below shows what I get when I print new df, although if I simply print(count) I get the correct counts for each mother_ID . Does anyone know what I'm doing wrong?
Use groupby transform('count'):
df['count'] = df.groupby('mother_ID')['hatchling_masses_g'].transform('count')
Notice the difference between groupby count and groupby tranform with 'count'.
Sample Data:
import numpy as np
import pandas as pd
np.random.seed(5)
df = pd.DataFrame({
'mother_ID': np.random.choice(['a', 'b'], 10),
'hatchling_masses_g': np.random.randint(1, 100, 10)
})
mother_ID hatchling_masses_g
0 b 63
1 a 28
2 b 31
3 b 81
4 a 8
5 a 77
6 a 16
7 b 54
8 a 81
9 a 28
groupby.count
counts = df.groupby('mother_ID')['hatchling_masses_g'].count()
mother_ID
a 6
b 4
Name: hatchling_masses_g, dtype: int64
Notice how there are only 2 rows. When assigning back to the DataFrame there are 10 rows which means that pandas doesn't know how to align the data back. Which results in NaNs indicating missing data:
df['count'] = counts
mother_ID hatchling_masses_g count
0 b 63 NaN
1 a 28 NaN
2 b 31 NaN
3 b 81 NaN
4 a 8 NaN
5 a 77 NaN
6 a 16 NaN
7 b 54 NaN
8 a 81 NaN
9 a 28 NaN
It's trying to find 'a' and 'b' in the index and since it cannot it fills with only NaN values.
groupby.tranform('count')
transform, on the other hand, will populate the entire group with the count:
counts = df.groupby('mother_ID')['hatchling_masses_g'].transform('count')
counts:
0 4
1 6
2 4
3 4
4 6
5 6
6 6
7 4
8 6
9 6
Name: hatchling_masses_g, dtype: int64
Notice 10 rows were created (one for every row in the DataFrame):
This assigns back to the dataframe nicely (since the indexes align):
df['count'] = counts
mother_ID hatchling_masses_g count
0 b 63 4
1 a 28 6
2 b 31 4
3 b 81 4
4 a 8 6
5 a 77 6
6 a 16 6
7 b 54 4
8 a 81 6
9 a 28 6
If needed counts can be done via groupby count, then join back to the DataFrame on the group key:
counts = df.groupby('mother_ID')['hatchling_masses_g'].count().rename('count')
df = df.join(counts, on='mother_ID')
counts:
mother_ID
a 6
b 4
Name: count, dtype: int64
df:
mother_ID hatchling_masses_g count
0 b 63 4
1 a 28 6
2 b 31 4
3 b 81 4
4 a 8 6
5 a 77 6
6 a 16 6
7 b 54 4
8 a 81 6
9 a 28 6

pandas dataframe sort columns according to column totals

I was able to sort rows according to the last column. However, I also have a row at the bottom of the dataframe which has the totals of each column. I couldn't find a way to sort the columns according to the totals in the last row. The table looks like the following:
A B C T
0 9 9 9 27
1 9 10 4 23
2 7 4 8 19
3 2 6 9 17
T 27 29 30
I want this table to be sorted so that the order of columns will be from left to right C, B, A from highest total to lowest. How can this be done?
Use DataFrame.sort_values by index value T with axis=1:
df = df.sort_values('T', axis=1, ascending=False)
print (df)
C B A T
0 9 9 9 27.0
1 4 10 9 23.0
2 8 4 7 19.0
3 9 6 2 17.0
T 30 29 27 NaN

Compute a ratio conditional on the value in the column of a panda dataframe

I have a dataframe of the following type
df = pd.DataFrame({'Days':[1,2,5,6,7,10,11,12],
'Value':[100.3,150.5,237.0,314.15,188.0,413.0,158.2,268.0]})
Days Value
0 1 100.3
1 2 150.5
2 5 237.0
3 6 314.15
4 7 188.0
5 10 413.0
6 11 158.2
7 12 268.0
and I would like to add a column '+5Ratio' whose date is the ratio betwen Value corresponding to the Days+5 and Days.
For example in first row I would have 3.13210368893 = 314.15/100.3, in the second I would have 1.24916943522 = 188.0/150.5 and so on.
Days Value +5Ratio
0 1 100.3 3.13210368893
1 2 150.5 1.24916943522
2 5 237.0 ...
3 6 314.15
4 7 188.0
5 10 413.0
6 11 158.2
7 12 268.0
I'm strugling to find a way to do it using lambda function.
Could someone give a help to find a way to solve this problem?
Thanks in advance.
Edit
In the case I am interested in the "Days" field can vary sparsly from 1 to 18180 for instance.
You can using merge , and the benefit from doing this , can handle missing value
s=df.merge(df.assign(Days=df.Days-5),on='Days')
s.assign(Value=s.Value_y/s.Value_x).drop(['Value_x','Value_y'],axis=1)
Out[359]:
Days Value
0 1 3.132104
1 2 1.249169
2 5 1.742616
3 6 0.503581
4 7 1.425532
Consider left merging on a helper dataframe, days, for consecutive daily points and then shift by 5 rows for ratio calculation. Finally remove the blank day rows:
days_df = pd.DataFrame({'Days':range(min(df.Days), max(df.Days)+1)})
days_df = days_df.merge(df, on='Days', how='left')
print(days_df)
# Days Value
# 0 1 100.30
# 1 2 150.50
# 2 3 NaN
# 3 4 NaN
# 4 5 237.00
# 5 6 314.15
# 6 7 188.00
# 7 8 NaN
# 8 9 NaN
# 9 10 413.00
# 10 11 158.20
# 11 12 268.00
days_df['+5ratio'] = days_df.shift(-5)['Value'] / days_df['Value']
final_df = days_df[days_df['Value'].notnull()].reset_index(drop=True)
print(final_df)
# Days Value +5ratio
# 0 1 100.30 3.132104
# 1 2 150.50 1.249169
# 2 5 237.00 1.742616
# 3 6 314.15 0.503581
# 4 7 188.00 1.425532
# 5 10 413.00 NaN
# 6 11 158.20 NaN
# 7 12 268.00 NaN

Refer to next index in pandas

If I had a simple pandas DataFrame like this:
frame = pd.DataFrame(np.arange(12).reshape((3,4)), columns=list('abcd'), index=list('123'))
I want find the max value from each row, and use this to find the next value in the column and add this value to a new column.
So the above DataFrame looks like this (with d2 changed to 3):
a b c d
1 1 2 3 4
2 5 6 7 3
3 9 10 11 12
So, conceptually the first row should be scanned, 4 is identified as the largest number, then 3 is found as the number within the same column but in the next index. Similarly for the row 2, 7 is the largest number, and 11 is the next number in that column. So 3 and 11 should get added to a new column like this:
a b c d Next
1 1 2 3 4 NaN
2 5 6 7 3 3
3 9 10 11 12 11
I started by making a function like this, but it only finds the max values.
f = lambda x: x.max()
max = frame.apply(f, axis='columns')
frame['Next'] = max
Based on your edit, you can use np.argsort:
i = np.arange(len(df))
j = pd.Series(np.argmax(df.values, axis=1))
df['next'] = df.shift(-1).values[i, j]
a b c d next
1 1 2 3 4 3.0
2 5 6 7 3 11.0
3 9 10 11 12 NaN

Select particular rows from inside groups in pandas dataframe

Suppose I have a dataframe that looks like this:
group level
0 1 10
1 1 10
2 1 11
3 2 5
4 2 5
5 3 9
6 3 9
7 3 9
8 3 8
The desired output is this:
group level
0 1 10
5 3 9
Namely, this is the logic: look inside each group, if there is more than 1 distinct value present in the level column, return the first row in that group. For example, no row from group 2 is selected, because the only value present in the level column is 5.
In addition, how does the situation change if I want the last, instead of the first row of such groups?
What I have attempted was combining group_by statements, with creating sets from entries in the level column, but failed to produce anything even nearly sensible.
This can be done with groupby and using apply to run a simple function on each group:
def get_first_val(group):
has_multiple_vals = len(group['level'].unique()) >= 2
if has_multiple_vals:
return group['level'].loc[group['level'].first_valid_index()]
else:
return None
df.groupby('group').apply(get_first_val).dropna()
Out[8]:
group
1 10
3 9
dtype: float64
There's also a last_valid_index() method, so you wouldn't have to
make any huge changes to get the last row instead.
If you have other columns that you want to keep, you just need a slight tweak:
import numpy as np
df['col1'] = np.random.randint(10, 20, 9)
df['col2'] = np.random.randint(20, 30, 9)
df
Out[17]:
group level col1 col2
0 1 10 19 21
1 1 10 18 24
2 1 11 14 23
3 2 5 14 26
4 2 5 10 22
5 3 9 13 27
6 3 9 16 20
7 3 9 18 26
8 3 8 11 2
def get_first_val_keep_cols(group):
has_multiple_vals = len(group['level'].unique()) >= 2
if has_multiple_vals:
return group.loc[group['level'].first_valid_index(), :]
else:
return None
df.groupby('group').apply(get_first_val_keep_cols).dropna()
Out[20]:
group level col1 col2
group
1 1 10 19 21
3 3 9 13 27
This would be simpler:
In [121]:
print df.groupby('group').\
agg(lambda x: x.values[0] if (x.values!=x.values[0]).any() else np.nan).\
dropna()
level
group
1 10
3 9
For each group, if any of the values are not the same as the first value, aggregate that group into its first value; otherwise, aggregate it to nan.
Finally, dropna().

Categories