I have this formula, I wanted to turn this into pandas calculation,
the formula is very easy:
NEW = A(where v=1) + A(where v=3) + A(where v=5)
I have a data frame like this:
Type subType value A NEW
X a 1 3 =3+9+9=21
X a 3 9
X a 5 9
X b 1 4 =4+5+0=9
X b 3 5
X b 5 0
Y a 1 1 =1+2+3=6
Y a 3 2
Y a 5 3
Y b 1 4 =4+5+2=11
Y b 3 5
Y b 5 2
Two questions:
I know I can just write down the calculation with the specified cell, but I want the code looks nicer, is there other ways to get the value?
Because there will be only two results for X & Y, how can I add them into my original dataframe for further calculation? (my thought is not to add them in the dataframe and just use the value whenever it's necessary for future calculation)
Quite new to coding, any answer will be appreciated!
Try this:
>>> import pandas as pd
>>> df = pd.DataFrame({'Type':['X','X','X','Y','Y','Y'], 'value':[1,3,5,1,3,5], 'A':[3,9,4,0,2,2]})
>>> df
Type value A
0 X 1 3
1 X 3 9
2 X 5 4
3 Y 1 0
4 Y 3 2
5 Y 5 2
>>> df.groupby('Type')['A'].sum()
Type
X 16
Y 4
>>> ur_dict = df.groupby('Type')['A'].sum().to_dict()
>>> df['NEW'] = df['Type'].map(ur_dict)
>>> df
Type value A NEW
0 X 1 3 16
1 X 3 9 16
2 X 5 4 16
3 Y 1 0 4
4 Y 3 2 4
5 Y 5 2 4
Hope this helps.
Edit to answer additional inquiry:
You are mapping tuple keys to a series, that will give you an error. You should shift the columns you need to map your dictionary into as index before doing the mapping.
See below:
>>> import pandas as pd
>>> df = pd.DataFrame({'Type':['X','X','X','X','X','X','Y','Y','Y','Y','Y','Y'], 'subType':['a','a','a','b','b','b','a','a','a','b','b','b'],'value':[1,3,5,1,3,5,1,3,5,1,3,5],'A':[3,9,9,4,5,0,1,2,3,4,5,2]})
>>> df
Type subType value A
0 X a 1 3
1 X a 3 9
2 X a 5 9
3 X b 1 4
4 X b 3 5
5 X b 5 0
6 Y a 1 1
7 Y a 3 2
8 Y a 5 3
9 Y b 1 4
10 Y b 3 5
11 Y b 5 2
>>> df.groupby(['Type', 'subType'])['A'].sum()
Type subType
X a 21
b 9
Y a 6
b 11
Name: A, dtype: int64
>>> ur_dict = df.groupby(['Type', 'subType'])['A'].sum().to_dict()
>>> ur_dict
{('X', 'a'): 21, ('X', 'b'): 9, ('Y', 'a'): 6, ('Y', 'b'): 11}
>>> df['NEW'] = df.set_index(['Type', 'subType']).index.map(ur_dict)
>>> df
Type subType value A NEW
0 X a 1 3 21
1 X a 3 9 21
2 X a 5 9 21
3 X b 1 4 9
4 X b 3 5 9
5 X b 5 0 9
6 Y a 1 1 6
7 Y a 3 2 6
8 Y a 5 3 6
9 Y b 1 4 11
10 Y b 3 5 11
11 Y b 5 2 11
Sample
I have 1000 by 6 dataframe, where A,B,C,D were rated by people on scale of 1-10.
In SELECT column, I have a value, which in all cases is same as value in either of A/B/C/D.
I want to change value in 'SELECT' to name of column to which it matches. For example, for ID 1, SELECT = 1, and D = 1, so the value of select should change to D.
import pandas as pd
df = pd.read_excel("u.xlsx",sheet_name = "Sheet2",header = 0)
But I am lost how to proceed.
Gwenersl solution compare all columns without ID and SELECT filtered by difference with DataFrame.eq (==), check first True value by idxmax and also if not exist matching value is set value no match with numpy.where:
cols = df.columns.difference(['ID','SELECT'])
mask = df[cols].eq(df['SELECT'], axis=0)
df['SELECT'] = np.where(mask.any(axis=1), mask.idxmax(axis=1), 'no match')
print (df)
ID A B C D SELECT
0 1 4 9 7 1 D
1 2 5 7 2 8 C
2 3 7 4 8 6 C
Detail:
print (mask)
A B C D
0 False False False True
1 False False True False
2 False False True False
Assuming the values in A, B, C, D are unique in each row with respect to SELECT, I'd do it like this:
>>> df
ID A B C D SELECT
0 1 4 9 7 1 1
1 2 5 7 2 8 2
2 3 7 4 8 6 8
>>>
>>> df_abcd = df.loc[:, 'A':'D']
>>> df['SELECT'] = df_abcd.apply(lambda row: row.isin(df['SELECT']).idxmax(), axis=1)
>>> df
ID A B C D SELECT
0 1 4 9 7 1 D
1 2 5 7 2 8 C
2 3 7 4 8 6 C
Use -
df['SELECT2'] = df.columns[pd.DataFrame([df['SELECT'] == df['A'], df['SELECT'] == df['B'], df['SELECT'] == df['C'], df['SELECT'] == df['D']]).transpose().idxmax(1)+1]
Output
ID A B C D SELECT SELECT2
0 1 4 9 7 1 1 D
1 2 5 7 2 8 2 C
2 3 7 4 8 6 8 C
Dateframe df:A B C D E
1 2 4 6 #Value to be updated for this column
12 34 5 54
4 8 12 4
3 5 6 2
5 7 11 27
numpy ndarray(shape(4*1)):
npar= ([12]
[6]
[2]
[27]
)
I have above dataframe df and array npar, I want to compare value of column D in array npar. if value of column D is found in array npar anywhere . I want to update column E with 1 else 0 for that row of dataframe df. Kindly suggest how I can do this with sample code.
You need isin, but first is necessery flatten array by numpy.ravel and last convert boolean mask to integers - Trues are 1s and Falses are 0s:
df['E'] = df.D.isin(npar.ravel()).astype(int)
print (df)
A B C D E
0 1 2 4 6 1
1 12 34 5 54 0
2 4 8 12 4 0
3 3 5 6 2 1
4 5 7 11 27 1
Detail:
npar = np.array([[12],[6],[2],[27]])
print (npar)
[[12]
[ 6]
[ 2]
[27]]
print (npar.ravel())
[12 6 2 27]
print (df.D.isin(npar.ravel()))
0 True
1 False
2 False
3 True
4 True
Name: D, dtype: bool
I want to delete duplicate adjacent rows in a dataframe. I was trying to do this with df[df.shift() != df].dropna().reset_index(drop=True) but shift() is not behaving in the way I meant.
Look at the following example
In [11]: df
Out[11]:
x y
0 a 1
1 b 2
2 b 2
3 e 4
4 e 5
5 f 6
6 g 7
7 h 8
df.x[3] equals df.x[4] but the numbers are different. Though the output is the following:
In [13]: df[df.shift() != df]
Out[13]:
x y
0 a 1
1 b 2
2 NaN NaN
3 e 4
4 NaN 5
5 f 6
6 g 7
7 h 8
I want to delete the row if they are really duplicates, not if they contain some duplicate values. Any idea?
Well, look at df.shift() != df:
>>> df.shift() != df
x y
0 True True
1 True True
2 False False
3 True True
4 False True
5 True True
6 True True
7 True True
This is a 2D object, not 1D, so when you use it as a filter on a frame you keep the ones where you have True and get NaN with the ones where you have False. It sounds like you want to keep the ones where either are True -- where any are True -- which is a 1D object:
>>> (df.shift() != df).any(axis=1)
0 True
1 True
2 False
3 True
4 True
5 True
6 True
7 True
dtype: bool
>>> df[(df.shift() != df).any(axis=1)]
x y
0 a 1
1 b 2
3 e 4
4 e 5
5 f 6
6 g 7
7 h 8
I have a DataFrame x with three columns;
a b c
1 1 10 4
2 5 6 5
3 4 6 5
4 2 11 9
5 1 2 10
... and a Series y of two values;
t
1 3
2 7
Now I'd like to get a DataFrame z with two columns;
t sum_c
1 3 18
2 7 13
... with t from y and sum_c the sum of c from x for all rows where t was larger than a and smaller than b.
Would anybody be able to help me with this?
here is a possible solution based on the given condition (the expected results listed in ur question dont quite line up with the given condition):
In[99]: df1
Out[99]:
a b c
0 1 10 4
1 5 6 5
2 4 6 5
3 2 11 9
4 1 2 10
In[100]: df2
Out[100]:
t
0 3
1 5
then write a function which would be used by pandas.apply() later:
In[101]: def cond_sum(x):
return sum(df1['c'].ix[np.logical_and(df1['a']<x.ix[0],df1['b']>x.ix[0])])
finally:
In[102]: df3 = df2.apply(cond_sum,axis=1)
In[103]: df3
Out[103]:
0 13
1 18
dtype: int64