Combining two datasets to form a boolean column (pandas) - python

I have two DataFrames in pandas:
dfm_one
data group_a group_b
0 3 a z
1 1 a z
2 2 b x
3 0 b x
4 0 b x
5 1 b z
6 0 c x
7 0 c y
8 3 c z
9 3 c z
dfm_two
data group_a group_b
0 4 a x
1 4 a y
2 4 b x
3 4 b x
4 4 b y
5 1 b y
6 1 b z
7 1 c x
8 4 c y
9 3 c z
10 2 c z
As output I want a boolean column that indicates for dfm_one whether there is a matching data entry (i.e. has the same vale) in dfm_two for each group_a group_b combination.
So my expected output is:
0 False
1 False
2 False
3 False
4 False
5 True
6 False
7 False
8 True
9 True
I'm guessing the code should look something like:
dfm_one.groupby(['group_a','group_b']).apply(lambda x: ??)
and that the function inside apply should make use of the isin method.
Another solution might be to merge the two datasets but I think this is not trivial since there is no unique identifier in the DataFrame.

OK this is a slight hack, if we cast the df to str dtype then we can call sum to concatenate the rows into a string, we can use the resultant string as a kind of unique identifier and then call isin on the other df, again converting to a str:
In [91]:
dfm_one.astype(str).sum(axis=1).isin(dfm_two.astype(str).sum(axis=1))
Out[91]:
0 False
1 False
2 False
3 False
4 False
5 True
6 False
7 False
8 True
9 True
dtype: bool
Output from the conversions:
In [92]:
dfm_one.astype(str).sum(axis=1)
Out[92]:
0 3az
1 1az
2 2bx
3 0bx
4 0bx
5 1bz
6 0cx
7 0cy
8 3cz
9 3cz
dtype: object
In [93]:
dfm_two.astype(str).sum(axis=1)
Out[93]:
0 4ax
1 4ay
2 4bx
3 4bx
4 4by
5 1by
6 1bz
7 1cx
8 4cy
9 3cz
10 2cz
dtype: object

Related

Question about conditional calculation in pandas

I have this formula, I wanted to turn this into pandas calculation,
the formula is very easy:
NEW = A(where v=1) + A(where v=3) + A(where v=5)
I have a data frame like this:
Type subType value A NEW
X a 1 3 =3+9+9=21
X a 3 9
X a 5 9
X b 1 4 =4+5+0=9
X b 3 5
X b 5 0
Y a 1 1 =1+2+3=6
Y a 3 2
Y a 5 3
Y b 1 4 =4+5+2=11
Y b 3 5
Y b 5 2
Two questions:
I know I can just write down the calculation with the specified cell, but I want the code looks nicer, is there other ways to get the value?
Because there will be only two results for X & Y, how can I add them into my original dataframe for further calculation? (my thought is not to add them in the dataframe and just use the value whenever it's necessary for future calculation)
Quite new to coding, any answer will be appreciated!
Try this:
>>> import pandas as pd
>>> df = pd.DataFrame({'Type':['X','X','X','Y','Y','Y'], 'value':[1,3,5,1,3,5], 'A':[3,9,4,0,2,2]})
>>> df
Type value A
0 X 1 3
1 X 3 9
2 X 5 4
3 Y 1 0
4 Y 3 2
5 Y 5 2
>>> df.groupby('Type')['A'].sum()
Type
X 16
Y 4
>>> ur_dict = df.groupby('Type')['A'].sum().to_dict()
>>> df['NEW'] = df['Type'].map(ur_dict)
>>> df
Type value A NEW
0 X 1 3 16
1 X 3 9 16
2 X 5 4 16
3 Y 1 0 4
4 Y 3 2 4
5 Y 5 2 4
Hope this helps.
Edit to answer additional inquiry:
You are mapping tuple keys to a series, that will give you an error. You should shift the columns you need to map your dictionary into as index before doing the mapping.
See below:
>>> import pandas as pd
>>> df = pd.DataFrame({'Type':['X','X','X','X','X','X','Y','Y','Y','Y','Y','Y'], 'subType':['a','a','a','b','b','b','a','a','a','b','b','b'],'value':[1,3,5,1,3,5,1,3,5,1,3,5],'A':[3,9,9,4,5,0,1,2,3,4,5,2]})
>>> df
Type subType value A
0 X a 1 3
1 X a 3 9
2 X a 5 9
3 X b 1 4
4 X b 3 5
5 X b 5 0
6 Y a 1 1
7 Y a 3 2
8 Y a 5 3
9 Y b 1 4
10 Y b 3 5
11 Y b 5 2
>>> df.groupby(['Type', 'subType'])['A'].sum()
Type subType
X a 21
b 9
Y a 6
b 11
Name: A, dtype: int64
>>> ur_dict = df.groupby(['Type', 'subType'])['A'].sum().to_dict()
>>> ur_dict
{('X', 'a'): 21, ('X', 'b'): 9, ('Y', 'a'): 6, ('Y', 'b'): 11}
>>> df['NEW'] = df.set_index(['Type', 'subType']).index.map(ur_dict)
>>> df
Type subType value A NEW
0 X a 1 3 21
1 X a 3 9 21
2 X a 5 9 21
3 X b 1 4 9
4 X b 3 5 9
5 X b 5 0 9
6 Y a 1 1 6
7 Y a 3 2 6
8 Y a 5 3 6
9 Y b 1 4 11
10 Y b 3 5 11
11 Y b 5 2 11

Match Value and Get its Column Header in python

Sample
I have 1000 by 6 dataframe, where A,B,C,D were rated by people on scale of 1-10.
In SELECT column, I have a value, which in all cases is same as value in either of A/B/C/D.
I want to change value in 'SELECT' to name of column to which it matches. For example, for ID 1, SELECT = 1, and D = 1, so the value of select should change to D.
import pandas as pd
df = pd.read_excel("u.xlsx",sheet_name = "Sheet2",header = 0)
But I am lost how to proceed.
Gwenersl solution compare all columns without ID and SELECT filtered by difference with DataFrame.eq (==), check first True value by idxmax and also if not exist matching value is set value no match with numpy.where:
cols = df.columns.difference(['ID','SELECT'])
mask = df[cols].eq(df['SELECT'], axis=0)
df['SELECT'] = np.where(mask.any(axis=1), mask.idxmax(axis=1), 'no match')
print (df)
ID A B C D SELECT
0 1 4 9 7 1 D
1 2 5 7 2 8 C
2 3 7 4 8 6 C
Detail:
print (mask)
A B C D
0 False False False True
1 False False True False
2 False False True False
Assuming the values in A, B, C, D are unique in each row with respect to SELECT, I'd do it like this:
>>> df
ID A B C D SELECT
0 1 4 9 7 1 1
1 2 5 7 2 8 2
2 3 7 4 8 6 8
>>>
>>> df_abcd = df.loc[:, 'A':'D']
>>> df['SELECT'] = df_abcd.apply(lambda row: row.isin(df['SELECT']).idxmax(), axis=1)
>>> df
ID A B C D SELECT
0 1 4 9 7 1 D
1 2 5 7 2 8 C
2 3 7 4 8 6 C
Use -
df['SELECT2'] = df.columns[pd.DataFrame([df['SELECT'] == df['A'], df['SELECT'] == df['B'], df['SELECT'] == df['C'], df['SELECT'] == df['D']]).transpose().idxmax(1)+1]
Output
ID A B C D SELECT SELECT2
0 1 4 9 7 1 1 D
1 2 5 7 2 8 2 C
2 3 7 4 8 6 8 C

Compare Dataframe column with numpy ndarray and update value in dataframe

Dateframe df:A B C D E
1 2 4 6 #Value to be updated for this column
12 34 5 54
4 8 12 4
3 5 6 2
5 7 11 27
numpy ndarray(shape(4*1)):
npar= ([12]
[6]
[2]
[27]
)
I have above dataframe df and array npar, I want to compare value of column D in array npar. if value of column D is found in array npar anywhere . I want to update column E with 1 else 0 for that row of dataframe df. Kindly suggest how I can do this with sample code.
You need isin, but first is necessery flatten array by numpy.ravel and last convert boolean mask to integers - Trues are 1s and Falses are 0s:
df['E'] = df.D.isin(npar.ravel()).astype(int)
print (df)
A B C D E
0 1 2 4 6 1
1 12 34 5 54 0
2 4 8 12 4 0
3 3 5 6 2 1
4 5 7 11 27 1
Detail:
npar = np.array([[12],[6],[2],[27]])
print (npar)
[[12]
[ 6]
[ 2]
[27]]
print (npar.ravel())
[12 6 2 27]
print (df.D.isin(npar.ravel()))
0 True
1 False
2 False
3 True
4 True
Name: D, dtype: bool

How does pandas.shift really work?

I want to delete duplicate adjacent rows in a dataframe. I was trying to do this with df[df.shift() != df].dropna().reset_index(drop=True) but shift() is not behaving in the way I meant.
Look at the following example
In [11]: df
Out[11]:
x y
0 a 1
1 b 2
2 b 2
3 e 4
4 e 5
5 f 6
6 g 7
7 h 8
df.x[3] equals df.x[4] but the numbers are different. Though the output is the following:
In [13]: df[df.shift() != df]
Out[13]:
x y
0 a 1
1 b 2
2 NaN NaN
3 e 4
4 NaN 5
5 f 6
6 g 7
7 h 8
I want to delete the row if they are really duplicates, not if they contain some duplicate values. Any idea?
Well, look at df.shift() != df:
>>> df.shift() != df
x y
0 True True
1 True True
2 False False
3 True True
4 False True
5 True True
6 True True
7 True True
This is a 2D object, not 1D, so when you use it as a filter on a frame you keep the ones where you have True and get NaN with the ones where you have False. It sounds like you want to keep the ones where either are True -- where any are True -- which is a 1D object:
>>> (df.shift() != df).any(axis=1)
0 True
1 True
2 False
3 True
4 True
5 True
6 True
7 True
dtype: bool
>>> df[(df.shift() != df).any(axis=1)]
x y
0 a 1
1 b 2
3 e 4
4 e 5
5 f 6
6 g 7
7 h 8

Summing over a DataFrame with two conditions and multiple values

I have a DataFrame x with three columns;
a b c
1 1 10 4
2 5 6 5
3 4 6 5
4 2 11 9
5 1 2 10
... and a Series y of two values;
t
1 3
2 7
Now I'd like to get a DataFrame z with two columns;
t sum_c
1 3 18
2 7 13
... with t from y and sum_c the sum of c from x for all rows where t was larger than a and smaller than b.
Would anybody be able to help me with this?
here is a possible solution based on the given condition (the expected results listed in ur question dont quite line up with the given condition):
In[99]: df1
Out[99]:
a b c
0 1 10 4
1 5 6 5
2 4 6 5
3 2 11 9
4 1 2 10
In[100]: df2
Out[100]:
t
0 3
1 5
then write a function which would be used by pandas.apply() later:
In[101]: def cond_sum(x):
return sum(df1['c'].ix[np.logical_and(df1['a']<x.ix[0],df1['b']>x.ix[0])])
finally:
In[102]: df3 = df2.apply(cond_sum,axis=1)
In[103]: df3
Out[103]:
0 13
1 18
dtype: int64

Categories