average of specific columns and storing them in new column - python

What am I doing wrong here? I have a dataframe where I am adding two new columns the first creates a count by adding all the values in each column to the right that are equal to 1. That part works fine. The next part of the code should give the average of all the values to the right that are not equal to 0. For some reason it is also taking the values to the left into account. Here is the code. Thanks for any help.
I have tried my code as well as both solutions below and am still getting the wrong average. Here's a simplified version with a random dataframe, and all three versions of the code. I have removed values to the left and still have the issue of the average being wrong. Maybe this will make help.
Version 1:
df = pd.DataFrame(np.random.randint(0,3,size=(10, 10)), columns=list('ABCDEFGHIJ'))
idx_last = len(df.columns)
df.insert(loc=0, column='new', value=df[df[0:(idx_last+1)]==1].sum(axis=1))
idx_last = len(df.columns)
df.insert(loc=1, column='avg', value=df[df[0:(idx_last+1)]!=0].mean(axis=1))
df
Version 2:
df = pd.DataFrame(np.random.randint(0,3,size=(10, 10)), columns=list('ABCDEFGHIJ'))
df.insert(loc=0, column='new', value=(df.iloc[:, 0:]==1).sum(axis=1))
df.insert(loc=1, column='avg', value=(df.iloc[:, 1:]!=0).mean(axis=1))
df
Version 3:
df = pd.DataFrame(np.random.randint(0,3,size=(10, 10)), columns=list('ABCDEFGHIJ'))
idx_last = len(df.columns)
loc_value=0
df.insert(loc=loc_value, column='new', value=df[df[loc_value:(idx_last+1)]==1].sum(axis=1))
idx_last = len(df.columns)
loc_value=1
df.insert(loc=loc_value, column='avg', value=df[df[loc_value: (idx_last+1)]!=0].sum(axis=1))
df

I believe you need DataFrame.iloc function for get columns by positions, because is added new column is necessary use position + 1 for avg column with DataFrame.where for replace non matched values to missing values:
np.random.seed(123)
df = pd.DataFrame(np.random.randint(0,3,size=(10, 5)), columns=list('ABCDE'))
df.insert(loc=0, column='new', value=(df.iloc[:, 0:]==1).sum(axis=1))
df.insert(loc=1, column='avg', value=(df.iloc[:, 1:].where(df.iloc[:, 1:]!=0)).mean(axis=1))
print (df)
new avg A B C D E
0 1 1.750000 2 1 2 2 0
1 2 1.600000 2 2 1 2 1
2 2 1.500000 2 1 0 1 2
3 2 1.333333 1 0 2 0 1
4 1 1.500000 2 1 0 0 0
5 1 1.666667 0 1 2 0 2
6 2 1.000000 0 0 1 0 1
7 1 1.500000 0 0 0 2 1
8 2 1.600000 1 2 2 2 1
9 1 1.500000 0 0 2 1 0
Or use helper DataFrame in df1 variable:
np.random.seed(123)
df = pd.DataFrame(np.random.randint(0,3,size=(10, 5)), columns=list('ABCDE'))
df1 = df.copy()
df.insert(loc=0, column='new', value=(df1==1).sum(axis=1))
df.insert(loc=1, column='avg', value=df1.where(df1!=0).mean(axis=1))
print (df)
new avg A B C D E
0 1 1.750000 2 1 2 2 0
1 2 1.600000 2 2 1 2 1
2 2 1.500000 2 1 0 1 2
3 2 1.333333 1 0 2 0 1
4 1 1.500000 2 1 0 0 0
5 1 1.666667 0 1 2 0 2
6 2 1.000000 0 0 1 0 1
7 1 1.500000 0 0 0 2 1
8 2 1.600000 1 2 2 2 1
9 1 1.500000 0 0 2 1 0

The issue arises with the expression, (df.iloc[:, 1:]!=0).mean(axis=1). It is because df.iloc[:, 1:]!=0 will return a matrix of booleans, as it is a comparing expression. Taking a mean of such values will not give the mean of original values, as the maximum value in such matrix will anyway be 1.
Hence, the following would do the job (note the indexing as well)
df = pd.DataFrame(np.random.randint(0,3,size=(10, 10)), columns=list('ABCDEFGHIJ'))
df.insert(loc=0, column='new', value=(df.iloc[:, 0:]==1).sum(axis=1))
df.insert(loc=1, column='avg', value=(df.iloc[:, 1:]!=0).sum(axis=1)) #just keeping the count of non zeros
df["avg"]=df.iloc[:, 2:].sum(axis=1)/df["avg"]

Related

Walking average based on two matching columns

I have a dataframe df of the following format:
team1 team2 score1 score2
0 1 2 1 0
1 3 4 3 0
2 1 3 1 1
3 2 4 0 2
4 1 2 3 2
What I want to do is to create a new column that will return rolling average of the score1 column of last 3 games but only when the two teams from team1 and team2 are matching.
Expected output:
team1 team2 score1 score2 new
0 1 2 1 0 1
1 3 4 3 0 3
2 1 3 1 1 1
3 2 4 0 2 0
4 1 2 3 2 2
I was able to calculate walking average for all games for each team separately like that:
df['new'] = df.groupby('team1')['score1'].transform(lambda x: x.rolling(3, min_periods=1).mean()
but cannot find a sensible way to expand that to match two teams.
I tried the code below that returns... something, but definitely not what I need.
df['new'] = df.groupby(['team1','team2'])['score1'].transform(lambda x: x.rolling(3, min_periods=1).mean()
I suppose this could be done with apply() but I want to avoid it due to performace issues.
Not sure what is your exact expected output, but you can first reshape the DataFrame to a long format:
(pd.wide_to_long(df.reset_index(), ['team', 'score'], i='index', j='x')
.groupby('team')['score']
.rolling(3, min_periods=1).mean()
)
Output:
team index x
1 0 1 1.0
2 1 1.0
2 3 1 0.0
0 2 0.0
3 1 1 3.0
2 2 2.0
4 1 2 0.0
3 2 1.0
Name: score, dtype: float64
The walkaround I've found was to create 'temp' column that merges the values in 'team1' and 'team2' and uses that column as a reference for the rolling average.
df['temp'] = df.team1+'_'+df.team2
df['new'] = df.groupby('temp')['score1'].transform(lambda x: x.rolling(3, min_periods=1).mean()
Can this be done in one line?

i cant find the min value(which is>0) in each row in selected columns df[df[col]>0]

this is my data and i want to find the min value of selected columns(a,b,c,d) in each row then calculate the difference between that and dd. I need to ignore 0 in rows, I mean in the first row i need to find 8
need to ignore 0 in rows
Then just replace it with nan, consider following simple example
import numpy as np
import pandas as pd
df = pd.DataFrame({"A":[1,2,0],"B":[3,5,7],"C":[7,0,7]})
df.replace(0,np.nan).apply(min)
df["minvalue"] = df.replace(0,np.nan).apply("min",axis=1)
print(df)
gives output
A B C minvalue
0 1 3 7 1.0
1 2 5 0 2.0
2 0 7 7 7.0
You can use pandas.apply with axis=1 and all column ['a','b','c','d'] convert to Series then replace 0 with +inf and find min. At the end compute diff min with colmun 'dd'.
import numpy as np
df['min_dd'] = df.apply(lambda row: min(pd.Series(row[['a','b','c','d']]).replace(0,np.inf)) - row['d'], axis=1)
print(df)
a b c d dd min_dd
0 0 15 0 8 6 2.0 # min_without_zero : 8 , dd : 6 -> 8-6=2
1 2 0 5 3 2 0.0 # min_without_zero : 2 , dd : 2 -> 2-2=0
2 5 3 3 0 2 1.0 # 3 - 2
3 0 2 3 4 2 0.0 # 2 - 2
You can try
cols = ['a','b','c','d']
df['res'] = df[cols][df[cols].ne(0)].min(axis=1) - df['dd']
print(df)
a b c d dd res
0 0 15 0 8 6 2.0
1 2 0 5 3 2 0.0
2 5 3 3 0 2 1.0
3 2 3 4 4 2 0.0

How could I replace null value In a group?

I created this dataframe I calculated the gap that I was looking but the problem is that some flats have the same price and I get a difference of price of 0. How could I replace the value 0 by the difference with the last lower price of the same group.
for example:
neighboorhood:a, bed:1, bath:1, price:5
neighboorhood:a, bed:1, bath:1, price:5
neighboorhood:a, bed:1, bath:1, price:3
neighboorhood:a, bed:1, bath:1, price:2
I get difference price of 0,2,1,nan and I'm looking for 2,2,1,nan (briefly I don't want to compare 2 flats with the same price)
Thanks in advance and good day.
data=[
[1,'a',1,1,5],[2,'a',1,1,5],[3,'a',1,1,4],[4,'a',1,1,2],[5,'b',1,2,6],[6,'b',1,2,6],[7,'b',1,2,3]
]
df = pd.DataFrame(data, columns = ['id','neighborhoodname', 'beds', 'baths', 'price'])
df['difference_price'] = ( df.dropna()
.sort_values('price',ascending=False)
.groupby(['city','beds','baths'])['price'].diff(-1) )
I think you can remove duplicates first per all columns used for groupby with diff, create new column in filtered data and last use merge with left join to original:
df1 = (df.dropna()
.sort_values('price',ascending=False)
.drop_duplicates(['neighborhoodname','beds','baths', 'price']))
df1['difference_price'] = df1.groupby(['neighborhoodname','beds','baths'])['price'].diff(-1)
df = df.merge(df1[['neighborhoodname','beds','baths','price', 'difference_price']], how='left')
print (df)
id neighborhoodname beds baths price difference_price
0 1 a 1 1 5 1.0
1 2 a 1 1 5 1.0
2 3 a 1 1 4 2.0
3 4 a 1 1 2 NaN
4 5 b 1 2 6 3.0
5 6 b 1 2 6 3.0
6 7 b 1 2 3 NaN
Or you can use lambda function for back filling 0 values per groups for avoid wrong outputs if one row groups (data moved from another groups):
df['difference_price'] = (df.sort_values('price',ascending=False)
.groupby(['neighborhoodname','beds','baths'])['price']
.apply(lambda x: x.diff(-1).replace(0, np.nan).bfill()))
print (df)
id neighborhoodname beds baths price difference_price
0 1 a 1 1 5 1.0
1 2 a 1 1 5 1.0
2 3 a 1 1 4 2.0
3 4 a 1 1 2 NaN
4 5 b 1 2 6 3.0
5 6 b 1 2 6 3.0
6 7 b 1 2 3 NaN

How to fuse a small pandas.dataframe into a larger one based on values of a column?

I have two pandas.dataframe df1 and df2:
>>>import pandas as pd
>>>import numpy as np
>>>from random import random
>>>df1=pd.DataFrame({'x1':range(10), 'y1':np.repeat(0,10).tolist()})
>>>df2=pd.DataFrame({'x2':range(0,10,2), 'y2':[random() for _ in range(5)]})
>>>df1
x1 y1
0 0 0
1 1 0
2 2 0
3 3 0
4 4 0
5 5 0
6 6 0
7 7 0
8 8 0
9 9 0
>>>df2
x2 y2
0 0 0.075922
1 2 0.606703
2 4 0.272918
3 6 0.842641
4 8 0.576636
Now I want to fuse df2 into df1. This is to say, I want to change the values of y1 in df1 into the values of y2 in df2 when the value of x1 in df1 is equal to the value of x2 in df2. The final result I need is like the following:
>>>df1
x1 y1
0 0 0.075922
1 1 0
2 2 0.606703
3 3 0
4 4 0.272918
5 5 0
6 6 0.842641
7 7 0
8 8 0.576636
9 9 0
Although I can use the follow codes to get the above result:
>>> for i in range(df1.shape[0]):
... for j in range(df2.shape[0]):
... if df1.iloc[i,0] == df2.iloc[j,0]:
... df1.iloc[i,1]=df2.iloc[j,1]
...
I think there must be better ways to achieve this. Do you know what they are? Thank you in advance.
You can use df.update to update your df1 in place, eg:
df1.update({'y1': df2.set_index('x2')['y2']})
Gives you:
x1 y1
0 0 0.075922
1 1 0.000000
2 2 0.606703
3 3 0.000000
4 4 0.272918
5 5 0.000000
6 6 0.842641
7 7 0.000000
8 8 0.576636
9 9 0.000000
Use map and then replace missing values by original values by fillna:
df1['y1'] = df1['x1'].map(df2.set_index('x2')['y2']).fillna(df1['y1'])
print (df)
x1 y1
0 0 0.696469
1 1 0.000000
2 2 0.286139
3 3 0.000000
4 4 0.226851
5 5 0.000000
6 6 0.551315
7 7 0.000000
8 8 0.719469
9 9 0.000000
You can also use update after setting indices of both dataframes:
import pandas as pd
import numpy as np
from random import random
df1=pd.DataFrame({'x1':range(10), 'y1':np.repeat(0,10).tolist()})
#set index of the first dataframe to be 'x1'
df1.set_index('x1', inplace=True)
df2=pd.DataFrame({'x2':range(0,10,2), 'y1':[random() for _ in range(5)]})
#set index of the second dataframe to be 'x2'
df2.set_index('x2', inplace=True)
#update values in df1 with values in df
df1.update(df2)
#reset index if necessary (though index will look exactly like x1 column)
df1 = df1.reset_index()
Update() seems to be the best option here !
import pandas as pd
import numpy as np
from random import random
# your dataframes
df1 = pd.DataFrame({'x1': range(10), 'y1': np.repeat(0, 10).tolist()})
df2 = pd.DataFrame({'x2': range(0, 10, 2), 'y2': [random() for _ in range(5)]})
# printing df1 and df2 values before update
print(df1)
print(df2)
df1.update({'y1': df2.set_index('x2')['y2']})
# printing df1 after update was performed
print(df1)
Another method, adding the two dataframes together:
# first give df2 the same column names as df2
df2.columns = ['x1','y1']
#now set 'x1' as the index for both dfs (since this is what you want to 'join' on)
df1 = df1.set_index('x1')
df2 = df2.set_index('x1')
print(df1)
y1
x1
0 0
1 0
2 0
3 0
4 0
5 0
6 0
7 0
8 0
9 0
print(df2)
y1
x1
0 0.525232
2 0.907628
4 0.612100
6 0.497420
8 0.656509
#now you can simply add the two df's to eachother
df_new = df1 + df2
print(df_new)
y1
x1
0 0.317418
1 NaN
2 0.581443
3 NaN
4 0.728766
5 NaN
6 0.495450
7 NaN
8 0.171131
9 NaN
Two problems:
The dataframe has NA's where you want 0's. These are the positions where df2 was not defined. Those positions were effectively equal to NA in df2, and NA + anything = NA. This can be fixed with a
fillna
You want 'x1' to be a column, not the index so just reset the index
df_new=df_new.reset_index().fillna(0)
print(df_new)
x1 y1
0 0 0.118903
1 1 0.000000
2 2 0.465557
3 3 0.000000
4 4 0.533266
5 5 0.000000
6 6 0.518484
7 7 0.000000
8 8 0.308733
9 9 0.000000

computing daily return/increment on dataframe

So ive some timeseries data on which i want to compute daily return/increment, where Daily increment = value_at_time(T)/ value_at_time(T-1)
import pandas as pd
df=pd.DataFrame([1,2,3,7]) #Sample data frame
df[1:]
out:
0
1 2
2 3
3 7
df[:-1]
out:
0
0 1
1 2
2 3
######### Method 1
df[1:]/df[:-1]
out:
0
0 NaN
1 1
2 1
3 NaN
######### Method 2
df[1:]/df[:-1].values
out:
0
1 2.000000
2 1.500000
3 2.333333
######### Method 3
df[1:].values/df[:-1]
out:
0
0 2
1 1
2 2
My questions are that
If df[:-1] and df[1:] have only three values (row slices of the
dataframe) then why doesnt Method_1 work ?
Why are method 2 & 3 which are almost similar giving different results?
Why using .values in Method_2 makes it work
Lets look at each
method 1, if you look at what the slices return you can see that the indices don't align:
In [87]:
print(df[1:])
print(df[:-1])
0
1 2
2 3
3 7
0
0 1
1 2
2 3
so then when do the division only 2 columns intersect:
In [88]:
df[1:]/df[:-1]
Out[88]:
0
0 NaN
1 1.0
2 1.0
3 NaN
Method 2 produces a np array, this has no index so the division will be performed in order element-wise as expected:
In [89]:
df[:-1].values
Out[89]:
array([[1],
[2],
[3]], dtype=int64)
Giving:
In [90]:
df[1:]/df[:-1].values
Out[90]:
0
1 2.000000
2 1.500000
3 2.333333
Method 3 is the same reason as method 2
So the question is how to do this in pure pandas? We use shift to allow you to align the indices as desired:
In [92]:
df.shift(-1)/df
Out[92]:
0
0 2.000000
1 1.500000
2 2.333333
3 NaN

Categories