Create a unique indicator two join two datasets in pandas/python - python

How can I combine four columns in a dataframe in pandas/python to create a unique indicator and do a left join?
Is this even the best way to do what I am trying to accomplish?
example: make a unique indicator (col5)
then setup a join with another dataframe using the same logic
col1 col2 col3 col4 col5
apple pear mango tea applepearmangotea
then do a join something like
pd.merge(df1, df2, how='left', on='col5')

This problem is the same whether its 4 columns or 2. You don't need to create a unique combined key. You just need to merge on multiple columns.
Consider the two dataframes d1 and d2. They share two columns in common.
d1 = pd.DataFrame([
[0, 0, 'a', 'b'],
[0, 1, 'c', 'd'],
[1, 0, 'e', 'f'],
[1, 1, 'g', 'h']
], columns=list('ABCD'))
d2 = pd.DataFrame([
[0, 0, 'a', 'b'],
[0, 1, 'c', 'd'],
[1, 0, 'e', 'f'],
[2, 0, 'g', 'h']
], columns=list('ABEF'))
d1
A B C D
0 0 0 a b
1 0 1 c d
2 1 0 e f
3 1 1 g h
d2
A B E F
0 0 0 a b
1 0 1 c d
2 1 0 e f
3 2 0 g h
We can perform the equivalent of a left join using pd.DataFrame.merge
d1.merge(d2, 'left')
A B C D E F
0 0 0 a b a b
1 0 1 c d c d
2 1 0 e f e f
3 1 1 g h NaN NaN
We can be explicit with the columns
d1.merge(d2, 'left', on=['A', 'B'])
A B C D E F
0 0 0 a b a b
1 0 1 c d c d
2 1 0 e f e f
3 1 1 g h NaN NaN

Related

Splitting comma separated values into rows - Data Cleaning

How to split a column into rows if values are separated with a comma? I am stuck in here. I have used the following code
xd = df.assign(var1=df['var1'].str.split(',')).explode('var1')
xd = xd.assign(var2=xd['var2'].str.split(',')).explode('var2')
xd
But the above code generate multiple irrelevant rows. I am stuck here. Please suggest answers
DataFrame.explode
For multiple columns, specify a non-empty list with each element be str or tuple, and all specified columns their list-like data on same row of the frame must have matching length.
From docs:
df = pd.DataFrame({'A': [[0, 1, 2], 'foo', [], [3, 4]],
'B': 1,
'C': [['a', 'b', 'c'], np.nan, [], ['d', 'e']]})
df
A B C
0 [0, 1, 2] 1 [a, b, c]
1 foo 1 NaN
2 [] 1 []
3 [3, 4] 1 [d, e]
Multi-column explode.
df.explode(list('AC'))
A B C
0 0 1 a
0 1 1 b
0 2 1 c
1 foo 1 NaN
2 NaN 1 NaN
3 3 1 d
3 4 1 e
For your specific question:
xd = df.assign(
var1=df['var1'].str.split(','),
var2=df['var2'].str.split(',')
).explode(['var1', 'var2'])
xd
var1 var2 var3
0 a e 1
0 b f 1
0 c g 1
0 d h 1
1 p s 2
1 q t 2
1 r u 2

drop rows using pandas groupby and filter

I'm trying to drop rows from a df where certain conditions are met. Using below, I'm grouping values using column C. For each unique group, I want to drop ALL rows where A is less than 1 AND B is greater than 100. This has to occur on the same row though. If I use .any() or .all(), it doesn't return what I want.
df = pd.DataFrame({
'A' : [1,0,1,0,1,0,0,1,0,1],
'B' : [101, 2, 3, 1, 5, 101, 2, 3, 4, 5],
'C' : ['d', 'd', 'd', 'd', 'e', 'e', 'e', 'f', 'f',],
})
df.groupby(['C']).filter(lambda g: g['A'].lt(1) & g['B'].gt(100))
initial df:
A B C
0 1 101 d # A is not lt 1 so keep all d's
1 0 2 d
2 1 3 d
3 0 1 d
4 1 5 e
5 0 101 e # A is lt 1 and B is gt 100 so drop all e's
6 0 2 e
7 1 3 f
8 0 4 f
9 1 5 f
intended out:
A B C
0 1 101 d
1 0 2 d
2 1 3 d
3 0 1 d
7 1 3 f
8 0 4 f
9 1 5 f
For better performnce get all C values match condition and then filter original column C by Series.isin in boolean indexing with inverted mask:
df1 = df[~df['C'].isin(df.loc[df['A'].lt(1) & df['B'].gt(100), 'C'])]
Another idea is use GroupBy.transform with GroupBy.any for test if match at least one value:
df1 = df[~(df['A'].lt(1) & df['B'].gt(100)).groupby(df['C']).transform('any')]
Your solution is possible with any and not for scalars, if large DataFrame it should be slow:
df1 = df.groupby(['C']).filter(lambda g:not ( g['A'].lt(1) & g['B'].gt(100)).any())
df1 = df.groupby(['C']).filter(lambda g: (g['A'].ge(1) | g['B'].le(100)).all())
print (df1)
A B C
0 1 101 d
1 0 2 d
2 1 3 d
3 0 1 d
7 1 3 f
8 0 4 f
9 1 5 f

Copy data from a row to another row in Pandas dataframe based on condition

Let's say I have a (pandas) dataframe like this:
Index A ID B C
1 a 1 0 0
2 b 2 0 0
3 c 2 a a
4 d 3 0 0
I want to copy the data of the third row to the second row, because their IDs are matching, but the data is not filled. However, I want to leave column 'A' intact. Looking for a result like this:
Index A ID B C
1 a 1 0 0
2 b 2 a a
3 c 2 a a
4 d 3 0 0
What would you suggest as solution?
You can try replacing '0' with NaN then ffill()+bfill() using groupby()+apply():
df[['B','C']]=df[['B','C']].replace('0',float('NaN'))
df[['B','C']]=df.groupby('ID')[['B','C']].apply(lambda x:x.ffill().bfill()).fillna('0')
output of df:
Index A ID B C
0 1 a 1 0 0
1 2 b 2 a a
2 3 c 2 a a
3 4 d 3 0 0
Note: you can also use transform() method in place of apply() method
You can use combine_first:
s = df.loc[df[["B","C"]].ne("0").all(1)].set_index("ID")[["B", "C"]]
print (s.combine_first(df.set_index("ID")).reset_index())
ID A B C Index
0 1 a 0 0 1.0
1 2 b a a 2.0
2 2 c a a 3.0
3 3 d 0 0 4.0
import pandas as pd
data = { 'A': ['a', 'b', 'c', 'd'], 'ID': [1, 2, 2, 3], 'B': [0, 0, 'a', 0], 'C': [0, 0, 'a', 0]}
df = pd.DataFrame(data)
df.index += 1
index_to_be_replaced = 2
index_to_use_to_replace = 3
columns_to_replace = ['ID', 'B', 'C']
columns_not_to_replace = ['A']
x = df[columns_not_to_replace].loc[index_to_be_replaced]
y = df[columns_to_replace].loc[index_to_use_to_replace]
df.loc[index_to_be_replaced] = pd.concat([x, y])
print(df)
Does it solve your problem? I would check on other pandas functions, as well. Like join, merge.
❯ python3 b.py
A ID B C
1 a 1 0 0
2 b 2 a a
3 c 2 a a
4 d 3 0 0

Pandas dataframe rearrangement stack to two value columns (for factorplots)

I have been trying to rearrange my dataframe to use it as input for a factorplot. The raw data would look like this:
A B C D
1 0 1 2 "T"
2 1 2 3 "F"
3 2 1 0 "F"
4 1 0 2 "T"
...
My question is how can I rearrange it into this form:
col val val2
1 A 0 "T"
1 B 1 "T"
1 C 2 "T"
2 A 1 "F"
...
I was trying:
df = DF.cumsum(axis=0).stack().reset_index(name="val")
However this produces only one value column not two.. thanks for your support
I would use melt, and you can sort it how ever you like
pd.melt(df.reset_index(),id_vars=['index','D'], value_vars=['A','B','C']).sort_values(by='index')
Out[40]:
index D variable value
0 1 T A 0
4 1 T B 1
8 1 T C 2
1 2 F A 1
5 2 F B 2
9 2 F C 3
2 3 F A 2
6 3 F B 1
10 3 F C 0
3 4 T A 1
7 4 T B 0
11 4 T C 2
then obviously you can name column as you like
df.set_index('index').rename(columns={'D': 'col', 'variable': 'val2', 'value': 'val'})
consider your dataframe df
df = pd.DataFrame([
[0, 1, 2, 'T'],
[1, 2, 3, 'F'],
[2, 1, 3, 'F'],
[1, 0, 2, 'T'],
], [1, 2, 3, 4], list('ABCD'))
solution
df.set_index('D', append=True) \
.rename_axis(['col'], 1) \
.rename_axis([None, 'val2']) \
.stack().to_frame('val') \
.reset_index(['col', 'val2']) \
[['col', 'val', 'val2']]

Create and populate a dataframe using the unique values of another dataframe

I have a dataframe df like this:
X1 X2 X3
0 a c a
1 b e c
2 c nan e
3 d nan nan
I would like to create a new dataframe newdf which has one column (uentries) that contains the unique entries of df and the three columns of df which are filled with 0 and 1 depending on whether the the entry of uentries exists in the respective column in df.
My desired output would therefore look as follows (uentries does not need to be ordered):
uentries X1 X2 X3
0 a 1 0 1
1 b 1 0 0
2 c 1 1 1
3 d 1 0 0
4 e 0 1 1
Currently, I do it like this:
import pandas as pd
import numpy as np
df = pd.DataFrame({'X1': ['a', 'b', 'c', 'd'],
'X2': ['c', 'e', 'nan', 'nan'],
'X3': ['a', 'c', 'e', 'nan']})
uniqueEntries = set([x for x in np.ravel(df.values) if str(x) != 'nan'])
newdf = pd.DataFrame()
newdf['uentries'] = list(uniqueEntries)
for coli in df.columns:
newdf[coli] = newdf['uentries'].isin(df[coli])
newdf.ix[:, 'X1':'X3'] = newdf.ix[:, 'X1':'X3'].astype(int)
which gives me the desired output.
Is it possible to fill newdf in a more efficient manner?
This is a simple way to approach this problem using pd.value_counts.
newdf = df.apply(pd.value_counts).fillna(0)
newdf['uentries'] = newdf.index
newdf = newdf[['uentries', 'X1','X2','X3']]
newdf
uentries X1 X2 X3
a a 1 0 1
b b 1 0 0
c c 1 1 1
d d 1 0 0
e e 0 1 1
nan nan 0 2 1
Then you can just drop the row with the nan values:
newdf.drop(['nan'])
uentries X1 X2 X3
a a 1 0 1
b b 1 0 0
c c 1 1 1
d d 1 0 0
e e 0 1 1
You can use get_dummies, sum and last concat with fillna:
import pandas as pd
df = pd.DataFrame({'X1': ['a', 'b', 'c', 'd'],
'X2': ['c', 'e', 'nan', 'nan'],
'X3': ['a', 'c', 'e', 'nan']})
print df
X1 X2 X3
0 a c a
1 b e c
2 c nan e
3 d nan nan
a = pd.get_dummies(df['X1']).sum()
b = pd.get_dummies(df['X2']).sum()
c = pd.get_dummies(df['X3']).sum()
print pd.concat([a,b,c], axis=1, keys=['X1','X2','X3']).fillna(0)
X1 X2 X3
a 1 0 1
b 1 0 0
c 1 1 1
d 1 0 0
e 0 1 1
nan 0 2 1
If you use np.nan in test data:
import pandas as pd
import numpy as np
import io
df = pd.DataFrame({'X1': ['a', 'b', 'c', 'd'],
'X2': ['c', 'e', np.nan, np.nan],
'X3': ['a', 'c', 'e', np.nan]})
print df
a = pd.get_dummies(df['X1']).sum()
b = pd.get_dummies(df['X2']).sum()
c = pd.get_dummies(df['X3']).sum()
print pd.concat([a,b,c], axis=1, keys=['X1','X2','X3']).fillna(0)
X1 X2 X3
a 1 0 1
b 1 0 0
c 1 1 1
d 1 0 0
e 0 1 1

Categories