Consider I've a dataframe of 10 rows having two columns A and B as following :
A B
0 21 6
1 87 0
2 87 0
3 25 0
4 25 0
5 14 0
6 79 0
7 70 0
8 54 0
9 35 0
In excel I can calculate the rolling mean like this excluding the first row:
How can I do this in pandas?
Here is what I've tried:
import pandas as pd
df = pd.read_clipboard() #copying the dataframe given above and calling read_clipboard will get the df populated
for i in range(1, len(df)):
df.loc[i, 'B'] = df[['A', 'B']].loc[i-1].mean()
This gives me the desired result matching excel. But is there a better pandas way to do it? I've tried using expanding and rolling did not produce desired result.
You have an exponentially weighted moving average, rather than a simple moving average. That's why pd.DataFrame.rolling didn't work. You might be looking for pd.DataFrame.ewm instead.
Starting from
df
Out[399]:
A B
0 21 6
1 87 0
2 87 0
3 25 0
4 25 0
5 14 0
6 79 0
7 70 0
8 54 0
9 35 0
df['B'] = df["A"].shift().fillna(df["B"]).ewm(com=1, adjust=False).mean()
df
Out[401]:
A B
0 21 6.000000
1 87 13.500000
2 87 50.250000
3 25 68.625000
4 25 46.812500
5 14 35.906250
6 79 24.953125
7 70 51.976562
8 54 60.988281
9 35 57.494141
Even on just ten rows, doing it this way speeds up the code by about a factor of 10 with %timeit (959 microseconds from 10.3ms). On 100 rows, this becomes a factor of 100 (1.1ms vs 110ms).
Related
I'm trying to create a new column in a df. I want the new column to equal the count of the number rows of each unique 'mother_ID, which is a different column in the df.
This is what I'm currently doing. It makes the new column but the new column is filled with 'NaN's.
df.columns = ['mother_ID', 'date_born', 'mother_mass_g', 'hatchling_masses_g']
df.to_numpy()
This is how the original df appears when I print it:
count = df.groupby('mother_ID').hatchling_masses_g.count()
df['count']= count
Pic below shows what I get when I print new df, although if I simply print(count) I get the correct counts for each mother_ID . Does anyone know what I'm doing wrong?
Use groupby transform('count'):
df['count'] = df.groupby('mother_ID')['hatchling_masses_g'].transform('count')
Notice the difference between groupby count and groupby tranform with 'count'.
Sample Data:
import numpy as np
import pandas as pd
np.random.seed(5)
df = pd.DataFrame({
'mother_ID': np.random.choice(['a', 'b'], 10),
'hatchling_masses_g': np.random.randint(1, 100, 10)
})
mother_ID hatchling_masses_g
0 b 63
1 a 28
2 b 31
3 b 81
4 a 8
5 a 77
6 a 16
7 b 54
8 a 81
9 a 28
groupby.count
counts = df.groupby('mother_ID')['hatchling_masses_g'].count()
mother_ID
a 6
b 4
Name: hatchling_masses_g, dtype: int64
Notice how there are only 2 rows. When assigning back to the DataFrame there are 10 rows which means that pandas doesn't know how to align the data back. Which results in NaNs indicating missing data:
df['count'] = counts
mother_ID hatchling_masses_g count
0 b 63 NaN
1 a 28 NaN
2 b 31 NaN
3 b 81 NaN
4 a 8 NaN
5 a 77 NaN
6 a 16 NaN
7 b 54 NaN
8 a 81 NaN
9 a 28 NaN
It's trying to find 'a' and 'b' in the index and since it cannot it fills with only NaN values.
groupby.tranform('count')
transform, on the other hand, will populate the entire group with the count:
counts = df.groupby('mother_ID')['hatchling_masses_g'].transform('count')
counts:
0 4
1 6
2 4
3 4
4 6
5 6
6 6
7 4
8 6
9 6
Name: hatchling_masses_g, dtype: int64
Notice 10 rows were created (one for every row in the DataFrame):
This assigns back to the dataframe nicely (since the indexes align):
df['count'] = counts
mother_ID hatchling_masses_g count
0 b 63 4
1 a 28 6
2 b 31 4
3 b 81 4
4 a 8 6
5 a 77 6
6 a 16 6
7 b 54 4
8 a 81 6
9 a 28 6
If needed counts can be done via groupby count, then join back to the DataFrame on the group key:
counts = df.groupby('mother_ID')['hatchling_masses_g'].count().rename('count')
df = df.join(counts, on='mother_ID')
counts:
mother_ID
a 6
b 4
Name: count, dtype: int64
df:
mother_ID hatchling_masses_g count
0 b 63 4
1 a 28 6
2 b 31 4
3 b 81 4
4 a 8 6
5 a 77 6
6 a 16 6
7 b 54 4
8 a 81 6
9 a 28 6
I am having trouble applying some logic across my entire dataset. I am able to apply the logic on a small "group" but not on all of the groups (note, the groups are made by primaryFilter and secondaryFilter. Do you all mind pointing me in the right direction to go about this?
Entire Data
import pandas as pd
import numpy as np
myInput = {
'primaryFilter': [100,100,100,100,100,100,100,100,100,100,200,200,200,200,200,200,200,200,200,200],
'secondaryFilter': [1,1,1,1,2,2,2,3,3,3,1,1,2,2,2,2,3,3,3,3],
'constantValuePerGroup': [15,15,15,15,20,20,20,17,17,17,10,10,30,30,30,30,22,22,22,22],
'someValue':[3,1,4,7,9,9,2,7,3,7,6,4,7,10,10,3,4,6,7,5]
}
df_input = pd.DataFrame(data=myInput)
df_input
Test Data (First Group)
df_test = df_input[df_input.primaryFilter.isin([100])]
df_test = df_test[df_test.secondaryFilter == 1.0]
df_test['newColumn'] = np.nan
for index,row in df_test.iterrows():
if index==0:
print("start")
df_test.loc[0, 'newColumn'] = 0
elif index==df_test.shape[0]-1:
df_test.loc[index, 'newColumn'] = df_test.loc[index-1, 'newColumn'] + df_test.loc[index-1, 'someValue']
print("end")
else:
print("inter")
df_test.loc[index, 'newColumn'] = df_test.loc[index-1, 'newColumn'] + df_test.loc[index-1, 'someValue']
df_test["delta"] = df_test["constantValuePerGroup"] - df_test['newColumn']
df_test.head()
Here is the output of the test
I now would like to apply the above logic to the remaining groups 100,2 and 100,3 and 200,1 and so forth..
No need to use iterrows here, you can group the dataframe on primaryFilter and secondaryFilter columns then for each unique group take the cumulative sum of values in column someValue and shift the resulting cummulative sum by 1 position downwards to obtain newColumn. Finally subtract newColumn from constantValuePerGroup to get the delta.
df_input['newColumn'] = df_input.groupby(['primaryFilter', 'secondaryFilter'])['someValue'].apply(lambda s: s.cumsum().shift(fill_value=0))
df_input['delta'] = df_input['constantValuePerGroup'] - df_input['newColumn']
>>> df_input
primaryFilter secondaryFilter constantValuePerGroup someValue newColumn delta
0 100 1 15 3 0 15
1 100 1 15 1 3 12
2 100 1 15 4 4 11
3 100 1 15 7 8 7
4 100 2 20 9 0 20
5 100 2 20 9 9 11
6 100 2 20 2 18 2
7 100 3 17 7 0 17
8 100 3 17 3 7 10
9 100 3 17 7 10 7
10 200 1 10 6 0 10
11 200 1 10 4 6 4
12 200 2 30 7 0 30
13 200 2 30 10 7 23
14 200 2 30 10 17 13
15 200 2 30 3 27 3
16 200 3 22 4 0 22
17 200 3 22 6 4 18
18 200 3 22 7 10 12
19 200 3 22 5 17 5
I have a pretty similiar question to another question on here.
Let's assume I have two dataframes:
df
volumne
11
24
30
df2
range_low range_high price
10 20 1
21 30 2
How can I filter the second dataframe, based for one row of the first dataframe, if the value range is true?
So for example (value 11 from df) leads to:
df3
range_low range_high price
10 20 1
wherelese (value 30 from df) leads to:
df3
I am looking for a way to check, if a specific value is in a range of another dataframe, and filter the dataframe based on this condition. In none python code:
Find 11 in
(10, 20), if True: df3 = filter on this row
(21, 30), if True: df3= filter on this row
if not
return empty frame
For loop solution use:
for v in df['volumne']:
df3 = df2[(df2['range_low'] < v) & (df2['range_high'] > v)]
print (df3)
For non loop solution is possible use cross join, but if large DataFrames there should be memory problem:
df = df.assign(a=1).merge(df2.assign(a=1), on='a', how='outer')
print (df)
volumne a range_low range_high price
0 11 1 10 20 1
1 11 1 21 30 2
2 24 1 10 20 1
3 24 1 21 30 2
4 30 1 10 20 1
5 30 1 21 30 2
df3 = df[(df['range_low'] < df['volumne']) & (df['range_high'] > df['volumne'])]
print (df3)
volumne a range_low range_high price
0 11 1 10 20 1
3 24 1 21 30 2
I have a similar problem (but with date ranges), and if df2 is too large, it will take for ever.
If the volumes are always integers, a faster solution is to create an intermediate dataframe where you associate each possible volume to a price (in one iteration) and then merge.
price_list=[]
for index, row in df2.iterrows():
x=pd.DataFrame(range(row['range_low'],row['range_high']+1),columns=['volume'])
x['price']=row['price']
price_list.append(x)
df_prices=pd.concat(price_list)
you will get something like this
volume price
0 10 1
1 11 1
2 12 1
3 13 1
4 14 1
5 15 1
6 16 1
7 17 1
8 18 1
9 19 1
10 20 1
0 21 2
1 22 2
2 23 2
3 24 2
4 25 2
5 26 2
6 27 2
7 28 2
8 29 2
9 30 2
then you can quickly associate associate a price to each volume in df
df.merge(df_prices,on='volume')
volume price
0 11 1
1 24 2
2 30 2
I have a pandas dataframe:
import numpy as np
import pandas as pd
df = pd.DataFrame(np.random.randint(0,40,size=(10,4)), columns=range(4), index = range(10))
df.head()
0 1 2 3
0 27 10 13 21
1 25 12 23 8
2 2 24 24 34
3 10 11 11 10
4 0 15 0 27
I'm using the idxmax function to get the columns that contain the maximum value.
df_max = df.idxmax(1)
df_max.head()
0 0
1 0
2 3
3 1
4 3
How can I use df_max along with df, to create a time-series of values corresponding to the maximum value in each row of df? This is the output I want:
0 27
1 25
2 34
3 11
4 27
5 37
6 35
7 32
8 20
9 38
I know I can achieve this using df.max(1), but I want to know how to arrive at this same output by using df_max, since I want to be able to apply df_max to other matrices (not df) which share the same columns and indices as df (but not the same values).
You may try df.lookup
df.lookup(df_max.index, df_max)
Out[628]: array([27, 25, 34, 11, 27], dtype=int64)
If you want Series/DataFrame, you pass the output to the Series/DataFrame constructor
pd.Series(df.lookup(df_max.index, df_max), index=df_max.index)
Out[630]:
0 27
1 25
2 34
3 11
4 27
dtype: int64
I have one massive pandas dataframe with this structure:
df1:
A B
0 0 12
1 0 15
2 0 17
3 0 18
4 1 45
5 1 78
6 1 96
7 1 32
8 2 45
9 2 78
10 2 44
11 2 10
And a second one, smaller like this:
df2
G H
0 0 15
1 1 45
2 2 31
I want to add a column to my first dataframe following this rule: column df1.C = df2.H when df1.A == df2.G
I manage to do it with for loops, but the database is massive and the code run really slowly so I am looking for a Pandas-way or numpy to do it.
Many thanks,
Boris
If you only want to match mutual rows in both dataframes:
import pandas as pd
df1 = pd.DataFrame({'Name':['Sara'],'Special ability':['Walk on water']})
df1
Name Special ability
0 Sara Walk on water
df2 = pd.DataFrame({'Name':['Sara', 'Gustaf', 'Patrik'],'Age':[4,12,11]})
df2
Name Age
0 Sara 4
1 Gustaf 12
2 Patrik 11
df = df2.merge(df1, left_on='Name', right_on='Name', how='left')
df
Name Age Special ability
0 Sara 4 NaN
1 Gustaf 12 Walk on water
2 Patrik 11 NaN
This Can allso be done with more than one matching argument: (In this example Patrik from df1 does not exist in df2 becuse they have different ages and therfore will not merge)
df1 = pd.DataFrame({'Name':['Sara','Patrik'],'Special ability':['Walk on water','FireBalls'],'Age':[12,83]})
df1
Name Special ability Age
0 Sara Walk on water 12
1 Patrik FireBalls 83
df2 = pd.DataFrame({'Name':['Sara', 'Gustaf', 'Patrik'],'Age':[4,12,11]})
df2
Name Age
0 Sara 4
1 Gustaf 12
2 Patrik 11
df = df2.merge(df1,left_on=['Name','Age'],right_on=['Name','Age'],how='left')
df
Name Age Special ability
0 Sara 12 Walk on water
1 Gustaf 12 NaN
2 Patrik 11 NaN
You probably want to use a merge:
df=df1.merge(df2,left_on="A",right_on="G")
will give you a dataframe with 3 columns, but the third one's name will be H
df.columns=["A","B","C"]
will then give you the column names you want
You can use map by Series created by set_index:
df1['C'] = df1['A'].map(df2.set_index('G')['H'])
print (df1)
A B C
0 0 12 15
1 0 15 15
2 0 17 15
3 0 18 15
4 1 45 45
5 1 78 45
6 1 96 45
7 1 32 45
8 2 45 31
9 2 78 31
10 2 44 31
11 2 10 31
Or merge with drop and rename:
df = df1.merge(df2,left_on="A",right_on="G", how='left')
.drop('G', axis=1)
.rename(columns={'H':'C'})
print (df)
A B C
0 0 12 15
1 0 15 15
2 0 17 15
3 0 18 15
4 1 45 45
5 1 78 45
6 1 96 45
7 1 32 45
8 2 45 31
9 2 78 31
10 2 44 31
11 2 10 31
Here's one vectorized NumPy approach -
idx = np.searchsorted(df2.G.values, df1.A.values)
df1['C'] = df2.H.values[idx]
idx could be computed in a simpler way with : df2.G.searchsorted(df1.A), but don't think that would be anymore efficient, because we want to use the underlying array with .values for performance as done earlier.