Pandas: idempotent/force join between dataframes with column overlap - python

I am working in a notebook, so if I run:
df1 = df1.join(series2)
It works fine. However, if I run it again, I receive the following error:
ValueError: columns overlap but no suffix specified
Because it is equivalent to df1 = df1.join(series2).join(series2). Is there any way I can force an overwrite on the overlapping columns without creating an endless amount of columns with the _y suffix?
Sample df1
index a
0 0
0 1
1 2
1 3
2 4
2 5
Sample series2
index b
0 1
1 2
2 3
Desired output from df1 = df1.join(series2)
index a b
0 0 1
0 1 1
1 2 2
1 3 2
2 4 3
2 5 3
Desired output from df1 = df1.join(series2); df1 = df1.join(series2)
# same as above because of forced overwrite on either the left or right join.
index a b
0 0 1
0 1 1
1 2 2
1 3 2
2 4 3
2 5 3

Related

Search and update values in other dataframe for specific columns

I have two different dataframe in pandas.
First
A
B
C
D
VALUE
1
2
3
5
0
1
5
3
2
0
2
5
3
2
0
Second
A
B
C
D
Value
5
3
3
2
1
1
5
4
3
1
I want column values A and B in the first dataframe to be searched in the second dataframe. If A and B values match then update the Value column.Search only 2 columns in other dataframe and update only 1 column. Actually the process we know in sql.
Result
A
B
C
D
VALUE
1
2
3
5
0
1
5
3
2
1
2
5
3
2
0
If you focus on the bold text, you can understand it more easily.Despite my attempts, I could not succeed. I only want 1 column to change but it also changes A and B. I only want the Value column of matches to change.
You can use a merge:
cols = ['A', 'B']
df1['VALUE'] = (df2.merge(df1[cols], on=cols, how='right')
['Value'].fillna(df1['VALUE'], downcast='infer')
)
output:
A B C D VALUE
0 1 2 3 5 0
1 1 5 3 2 1
2 2 5 3 2 0

Deleting rows in pandas unitil the specific value first occurred

I would like to delete the rows that users equal to 1 first occurred and its previous rows for each unique user in the DataFrame.
For instance, I have the following Dataframe, and I would like to get another dataframe which deletes the row in the "val" column 1 first occured and its previous rows for each user.
user val
0 1 0
1 1 1
2 1 0
3 1 1
4 2 0
5 2 0
6 2 1
7 2 0
8 3 1
9 3 0
10 3 0
11 3 0
12 3 1
user val
0 1 0
1 1 1
2 2 0
3 3 0
4 3 0
5 3 0
6 3 1
Sample Data
import pandas as pd
s = [1,1,1,1,2,2,2,2,3,3,3,3,3]
t = [0,1,0,1,0,0,1,0,1,0,0,0,1]
df = pd.DataFrame(zip(s,t), columns=['user', 'val'])
groupby checking cummax and shift to remove all rows before, and including, the first 1 in the 'val' column per user.
Assuming your values are either 1 or 0, also possible to create the mask with a double cumsum.
m = df.groupby('user').val.apply(lambda x: x.eq(1).cummax().shift().fillna(False))
# m = df.groupby('user').val.apply(lambda x: x.cumsum().cumsum().gt(1))
df.loc[m]
Output:
user val
2 1 0
3 1 1
7 2 0
9 3 0
10 3 0
11 3 0
12 3 1

Expand a list returned by a function to multiple columns (Pandas)

I have a function that I'm trying to call on each row of a dataframe and I would like it to return 20 different numeric values and each of those be in a separate column of the original dataframe.
For example, this is not the function, but it if this will work the actual one will
def doStuff(x):
return([x] * 5)
So this will just return the same number 5x. so if I have the dataframe
import pandas as pd
df = pd.DataFrame({'A' : [1,2]})
A
0 1
1 2
2 3
After calling
df = np.vectorize(doStuff)(df['A'])
It should end up looking like
A 1 2 3 4 5
0 1 1 1 1 1 1
1 2 2 2 2 2 2
2 3 3 3 3 3 3
I believe you need df.apply, twice.
In [1254]: df['A'].apply(np.vectorize(doStuff)).apply(pd.Series)
Out[1254]:
0 1 2 3 4
0 1 1 1 1 1
1 2 2 2 2 2
2 3 3 3 3 3
You may concatenate this with the original using pd.concat(..., axis=1):
In [1258]: pd.concat([df, df['A'].apply(np.vectorize(doStuff)).apply(pd.Series)], axis=1)
Out[1258]:
A 0 1 2 3 4
0 1 1 1 1 1 1
1 2 2 2 2 2 2
2 3 3 3 3 3 3
From pandas 0.23 you can use the result_type argument:
df = pd.DataFrame({'A' : [1,2]})
def doStuff(x):
return([x] * 5)
df.apply(doStuff, axis=1, result_type='expand')

return rows with unique pairs across columns

I'm trying to find rows that have unique pairs of values across 2 columns, so this dataframe:
A B
1 0
2 0
3 0
0 1
2 1
3 1
0 2
1 2
3 2
0 3
1 3
2 3
will be reduced to only the rows that don't match up if flipped, for instance 1 and 3 is a combination I only want returned once. So a check to see if the same pair exists if the columns are flipped (3 and 1) it can be removed. The table I'm looking to get is:
A B
0 2
0 3
1 0
1 2
1 3
2 3
Where there is only one occurrence of each pair of values that are mirrored if the columns are flipped.
I think you can use apply sorted + drop_duplicates:
df = df.apply(sorted, axis=1).drop_duplicates()
print (df)
A B
0 0 1
1 0 2
2 0 3
4 1 2
5 1 3
8 2 3
Faster solution with numpy.sort:
df = pd.DataFrame(np.sort(df.values, axis=1), index=df.index, columns=df.columns)
.drop_duplicates()
print (df)
A B
0 0 1
1 0 2
2 0 3
4 1 2
5 1 3
8 2 3
Solution without sorting with DataFrame.min and DataFrame.max:
a = df.min(axis=1)
b = df.max(axis=1)
df['A'] = a
df['B'] = b
df = df.drop_duplicates()
print (df)
A B
0 0 1
1 0 2
2 0 3
4 1 2
5 1 3
8 2 3
Loading the data:
import numpy as np
import pandas as pd
a = np.array("1 2 3 0 2 3 0 1 3 0 1 2".split("\t"),dtype=np.double)
b = np.array("0 0 0 1 1 1 2 2 2 3 3 3".split("\t"),dtype=np.double)
df = pd.DataFrame(dict(A=a,B=b))
In case you don't need to sort the entire DF:
df["trans"] = df.apply(
lambda row: (min(row['A'], row['B']), max(row['A'], row['B'])), axis=1
)
df.drop_duplicates("trans")

Rearranging a non-consecutive order of columns in pandas dataframe

I have a pandas data frame (result) df with n (variable) columns that I generated using the merge of two other data frames:
result1 = df1.merge(df2, on='ID', how='left')
result1 dataframe is expected to have a variable # of columns (this is part of a larger script). i want to arrange the columns in a way that the last 2 columns will be the second and third consecutively, then all the remaining columns will follow (while the first column stays as first column). If result1 is known to have 6 columns, then i could use:
result2=result1.iloc[:,[0,4,5,1,2,3]] #this works fine.
BUT, i need the 1,2,3 to be in a range format as it is not practical to enter the whole of the numbers for each df. So, i thought of using:
result2=result1.iloc[:,[0,len(result1.columns), len(result1.columns)-1, 1:len(result1.columns-2]]
#Assuming 6 columns : 0, 5 , 4 , 1, 2, 3
That would be the idea way but this is creating syntax errors. Any suggestions to fix this?
Instead of using slicing syntax, I'd just build a list and use that:
>>> df
0 1 2 3 4 5
0 0 1 2 3 4 5
1 0 1 2 3 4 5
2 0 1 2 3 4 5
3 0 1 2 3 4 5
4 0 1 2 3 4 5
>>> ncol = len(df.columns)
>>> df.iloc[:,[0, ncol-1, ncol-2] + list(range(1,ncol-2))]
0 5 4 1 2 3
0 0 5 4 1 2 3
1 0 5 4 1 2 3
2 0 5 4 1 2 3
3 0 5 4 1 2 3
4 0 5 4 1 2 3

Categories