a1 = pd.DataFrame({'A': [1,2,3], 'B': [2,3,4]})
b2 = pd.DataFrame({'A': [1,4], 'B': [3,6]})
and I wanna get
c = pd.DataFrame({'A': [1,2,3,4], 'B': [3,3,4,6]})
a1 and b2 merge on the key='A'
but when 'A' equal but B different, get b2 value
how can I get this work? have no idea.
First concatenate both dataframes under each other to get one big dataframe:
c = pd.concat([a1, b2], 0)
A B
0 1 2
1 2 3
2 3 4
0 1 3
1 4 6
Then group on column A to only get the unique values of A, by using last you make sure than when there is a duplicate the value of b2 is used. This gives:
c = c.groupby('A').last()
B
A
1 3
2 3
3 4
4 6
Then set reset index to get a nice numerical index.
c = c.reset_index()
which returns:
A B
0 1 3
1 2 3
2 3 4
3 4 6
To do it all in one go just enter the following lines of code:
c = pd.concat([a1, b2], 0)
c = c.groupby('A').last().reset_index()
Related
Suppose I have the following dataframe
# dictionary with list object in values
details = {
'A1' : [1,3,4,5],
'A2' : [2,3,5,6],
'A3' : [4,3,2,6],
}
# creating a Dataframe object
df = pd.DataFrame(details)
I want to query on each columns with the follow conditions to obtain a boolean mask and then perform the sum on axis=1
A1 >= 3
A2 >=3
A3 >=4
I would like to end-up with the following dataframe
details = {
'A1' : [1,3,4,5],
'A2' : [2,3,5,6],
'A3' : [4,3,2,6],
'score' : [1,2,2,3]
}
# creating a Dataframe object
df = pd.DataFrame(details)
How would you do it?
Since your operators are the same, you can try numpy broadcasting
import numpy as np
df['score'] = (df.T >= np.array([3,3,4])[:, None]).sum()
print(df)
A1 A2 A3 score
0 1 2 4 1
1 3 3 3 2
2 4 5 2 2
3 5 6 6 3
You could also do:
df.assign(score = (df >=[3,3,4]).sum(1))
A1 A2 A3 score
0 1 2 4 1
1 3 3 3 2
2 4 5 2 2
3 5 6 6 3
If you want to specifically align your comparators to each column, you can pass them as a dictionary that is alignable against the DataFrames columns.
>>> comparisons = pd.Series({'A1': 3, 'A2': 3, 'A3': 4})
>>> df['score'] = df.ge(comparisons).sum(axis=1)
>>> df
A1 A2 A3 score
0 1 2 4 1
1 3 3 3 2
2 4 5 2 2
3 5 6 6 3
For a little more manual control, you can always subset your df according to your comparators before performing the comparisons.
comparisons = pd.Series({'A1': 3, 'A2': 3, 'A3': 4})
df['score'] = df[comparisons.index].ge(comparisons).sum(axis=1)
I have data like this
ID INFO
1 A=2;B=2;C=5
2 A=3;B=4;C=1
3 A=1;B=3;C=2
I want to split the Info columns into
ID A B C
1 2 2 5
2 3 4 1
3 1 3 2
I can split columns with one delimiter by using
df['A'], df['B'], df['C'] = df['INFO'].str.split(';').str
then split again by = but this seems to not so efficient in case I have many rows and especially when there are so many field that cannot be hard-coded beforehand.
Any suggestion would be greatly welcome.
You could use named groups together with Series.str.extract. In the end concat back the 'ID'. This assumes you always have A=;B=;and C= in a line.
pd.concat([df['ID'],
df['INFO'].str.extract('A=(?P<A>\d);B=(?P<B>\d);C=(?P<C>\d)')], axis=1)
# ID A B C
#0 1 2 2 5
#1 2 3 4 1
#2 3 1 3 2
If you want a more flexible solution that can deal with cases where a single line might be 'A=1;C=2' then we can split on ';' and partition on '='. pivot in the end to get to your desired output.
### Starting Data
#ID INFO
#1 A=2;B=2;C=5
#2 A=3;B=4;C=1
#3 A=1;B=3;C=2
#4 A=1;C=2
(df.set_index('ID')['INFO']
.str.split(';', expand=True)
.stack()
.str.partition('=')
.reset_index(-1, drop=True)
.pivot(columns=0, values=2)
)
# A B C
#ID
#1 2 2 5
#2 3 4 1
#3 1 3 2
#4 1 NaN 2
Browsing a Series is much faster that iterating across the rows of a dataframe.
So I would do:
pd.DataFrame([dict([x.split('=') for x in t.split(';')]) for t in df['INFO']], index=df['ID']).reset_index()
It gives as expected:
ID A B C
0 1 2 2 5
1 2 3 4 1
2 3 1 3 2
It should be faster than splitting twice dataframe columns.
values = [dict(item.split("=") for item in value.split(";")) for value in df.INFO]
df[['a', 'b', 'c']] = pd.DataFrame(values)
This will give you the desired output:
ID INFO a b c
1 a=1;b=2;c=3 1 2 3
2 a=4;b=5;c=6 4 5 6
3 a=7;b=8;c=9 7 8 9
Explanation:
The first line converts every value to a dictionary.
e.g.
x = 'a=1;b=2;c=3'
dict(item.split("=") for item in x.split(";"))
results in :
{'a': '1', 'b': '2', 'c': '3'}
DataFrame can take a list of dicts as an input and turn it into a dataframe.
Then you only need to assign the dataframe to the columns you want:
df[['a', 'b', 'c']] = pd.DataFrame(values)
Another solution is Series.str.findAll to extract values and then apply(pd.Series):
df[["A", "B", "C"]] = df.INFO.str.findall(r'=(\d+)').apply(pd.Series)
df = df.drop("INFO", 1)
Details:
df = pd.DataFrame([[1, "A=2;B=2;C=5"],
[2, "A=3;B=4;C=1"],
[3, "A=1;B=3;C=2"]],
columns=["ID", "INFO"])
print(df.INFO.str.findall(r'=(\d+)'))
# 0 [2, 2, 5]
# 1 [3, 4, 1]
# 2 [1, 3, 2]
df[["A", "B", "C"]] = df.INFO.str.findall(r'=(\d+)').apply(pd.Series)
print(df)
# ID INFO A B C
# 0 1 A=2;B=2;C=5 2 2 5
# 1 2 A=3;B=4;C=1 3 4 1
# 2 3 A=1;B=3;C=2 1 3 2
# Remove INFO column
df = df.drop("INFO", 1)
print(df)
# ID A B C
# 0 1 2 2 5
# 1 2 3 4 1
# 2 3 1 3 2
Another solution :
#split on ';'
#explode
#then split on '='
#and pivot
df_INFO = (df.INFO
.str.split(';')
.explode()
.str.split('=',expand=True)
.pivot(columns=0,values=1)
)
pd.concat([df.ID,df_INFO],axis=1)
ID A B C
0 1 2 2 5
1 2 3 4 1
2 3 1 3 2
I have two Series (df1 and df2) of equal length, which need to be combined into one DataFrame column as follows. Each index has only one value or no values but never two values, so there are no duplicates (e.g. if df1 has a value 'A' at index 0, then df2 is empty at index 0, and vice versa).
df1 = c1 df2 = c2
0 A 0
1 B 1
2 2 C
3 D 3
4 E 4
5 5 F
6 6
7 G 7
The result I want is this:
0 A
1 B
2 C
3 D
4 E
5 F
6
7 G
I have tried .concat, .append and .union, but these do not produce the desired result. What is the correct approach then?
You can try so:
df1['new'] = df1['c1'] + df2['c2']
For an in-place solution, I recommend pd.Series.replace:
df1['c1'].replace('', df2['c2'], inplace=True)
print(df1)
c1
0 A
1 B
2 C
3 D
4 E
5 F
6
7 G
I am trying to get an output where I wish to add column d in d1 and d2 where a b c are same (like groupby).
For example
d1 = pd.DataFrame([[1,2,3,4]],columns=['a','b','c','d'])
d2 = pd.DataFrame([[1,2,3,4],[2,3,4,5]],columns=['a','b','c','d'])
then I'd like to get an output as
a b c d
0 1 2 3 8
1 2 3 4 5
Merging the two data frames and adding the resultant column d where a b c are same.
d1.add(d2) or radd gives me an aggregate of all columns
The solution should be a DataFrame which can be added again to another similarly.
Any help is appreciated.
You can use set_index first:
print (d2.set_index(['a','b','c'])
.add(d1.set_index(['a','b','c']), fill_value=0)
.astype(int)
.reset_index())
a b c d
0 1 2 3 8
1 2 3 4 5
df = pd.concat([d1, d2])
df.drop_duplicates()
a b c d
0 1 2 3 4
1 2 3 4 5
Df1
A B C
1 1 'a'
2 3 'b'
3 4 'c'
Df2
A B C
1 1 'k'
5 4 'e'
Expected output (after difference and merge of Df1 and Df2)
i.e. Df1-Df2 and then merge
output
A B C
1 1 'a'
2 3 'b'
3 4 'c'
5 4 'e'
The difference should be based on two columns A and B and not all three columns. I do not care what column C contains in both Df2 and Df1.
try this:
In [44]: df1.set_index(['A','B']).combine_first(df2.set_index(['A','B'])).reset_index()
Out[44]:
A B C
0 1 1 'a'
1 2 3 'b'
2 3 4 'c'
3 5 4 'e'
It's an outer join, then merging in column C from df2 if a value is not known in df1:
dfx = df1.merge(df2, how='outer', on=['A', 'B'])
dfx['C'] = dfx.apply(
lambda r: r.C_x if not pd.isnull(r.C_x) else r.C_y, axis=1)
dfx[['A', 'B', 'C']]
=>
A B C
0 1 1 a
1 2 3 b
2 3 4 c
3 5 4 e
Using concat and drop_duplicates:
output = pd.concat([df1, df2])
output = output.drop_duplicates(subset = ["A", "B"], keep = 'first')
* Desired df: *
A B C
0 1 1 a
1 2 3 b
2 3 4 c
1 5 4 e