I would like to join 2 dataframes, so that the result will be the intersection on the two datasets on the key column.
By doing this:
result = pd.merge(df1,df2,on='key', how='inner')
I will get what I need, but with extra columns of df2. I only want df1 columns in the results. (I do not want to delete them later).
Any ideas?
Thanks,
Here is a generic solution which will work for one and for multiple keys (joining) columns:
Setup:
In [28]: a = pd.DataFrame({'a':[1,2,3,4], 'b':[10,20,30,40], 'c':list('abcd')})
In [29]: b = pd.DataFrame({'a':[3,4,5,6], 'b':[30,41,51,61], 'c':list('efgh')})
In [30]: a
Out[30]:
a b c
0 1 10 a
1 2 20 b
2 3 30 c
3 4 40 d
In [31]: b
Out[31]:
a b c
0 3 30 e
1 4 41 f
2 5 51 g
3 6 61 h
multiple joining keys:
In [32]: join_cols = ['a','b']
In [33]: a.merge(b[join_cols], on=join_cols)
Out[33]:
a b c
0 3 30 c
single joining key:
In [34]: join_cols = ['a']
In [35]: a.merge(b[join_cols], on=join_cols)
Out[35]:
a b c
0 3 30 c
1 4 40 d
Related
I have two DataFrames
df_1:
idx A X
0 1 A
1 2 B
2 3 C
3 4 D
4 1 E
5 2 F
and
df_2:
idx B Y
0 1 H
1 2 I
2 4 J
3 2 K
4 3 L
5 1 M
my goal is get the following:
df_result:
idx A X B Y
0 1 A 1 H
1 2 B 2 I
2 4 D 4 J
3 2 F 2 K
I am trying to match both A and B columns, based on on the column Bfrom df_2.
Columns A and B repeat their content after getting to 4. The order matters here and because of that the row from df_1 with idx = 4 does not match the one from df_2 with idx = 5.
I was trying to use:
matching = list(set(df_1["A"]) & set(df_2["B"]))
and then
df1_filt = df_1[df_1['A'].isin(matching)]
df2_filt = df_2[df_2['B'].isin(matching)]
But this does not take the order into consideration.
I am looking for a solution without many for loops.
Edit:
df_result = pd.merge_asof(left=df_1, right=df_2, left_on='idx', right_on='idx', left_by='A', right_by='B', direction='backward', tolerance=2).dropna().drop(labels='idx', axis='columns').reset_index(drop=True)
Gets me what I want.
IIUC this should work:
df_result = df_1.merge(df_2,
left_on=['idx', 'A'], right_on=['idx', 'B'])
Having two dataframes df1 and df2 (same number of rows) how can we, very simply, take all the columns from df2 and add them to df1? Using join, we are joining them on the index or a given column, but assuming their index's are completely different and they have no columns in common. Is that doable (without the obvious way of looping over each column in df2and add them as new to df1)?
EDIT: added an example.
Note; no index, column names are mentioned since it should not matter (thats is the "problem").
df1= [[1,3,2,
[11,20,33]]
df2 = [["bird",np.nan,37,np.sqrt(2)]
["dog",0.123,3.14,0]]
pd.some_operation(df1,df2)
#[[1,3,2,"bird",np.nan,37,np.sqrt(2)]
#[11,20,33,"dog",0.123,3.14,0]]
Samples:
df1 = pd.DataFrame({
'A':list('abcdef'),
'B':[4,5,4,5,5,4],
'C':[7,8,9,4,2,3],
}, index = list('QRSTUW'))
df2 = pd.DataFrame({
'D':[1,3,5,7,1,0],
'E':[5,3,6,9,2,4],
'F':list('aaabbb')
}, index = list('KLMNOP'))
Pandas always use index values if use join or concat by axis=1, so for correct alignement is necessary create same index values:
df = df1.join(df2.set_index(df1.index))
df = pd.concat([df1, df2.set_index(df1.index)], axis=1)
print (df)
A B C D E F
Q a 4 7 1 5 a
R b 5 8 3 3 a
S c 4 9 5 6 a
T d 5 4 7 9 b
U e 5 2 1 2 b
W f 4 3 0 4 b
Or create default index in both DataFrames:
df = df1.reset_index(drop=True).join(df2.reset_index(drop=True))
df = pd.concat([df1.reset_index(drop=True), df2.reset_index(drop=True)], axis=1)
print (df)
A B C D E F
0 a 4 7 1 5 a
1 b 5 8 3 3 a
2 c 4 9 5 6 a
3 d 5 4 7 9 b
4 e 5 2 1 2 b
5 f 4 3 0 4 b
I have two Series (df1 and df2) of equal length, which need to be combined into one DataFrame column as follows. Each index has only one value or no values but never two values, so there are no duplicates (e.g. if df1 has a value 'A' at index 0, then df2 is empty at index 0, and vice versa).
df1 = c1 df2 = c2
0 A 0
1 B 1
2 2 C
3 D 3
4 E 4
5 5 F
6 6
7 G 7
The result I want is this:
0 A
1 B
2 C
3 D
4 E
5 F
6
7 G
I have tried .concat, .append and .union, but these do not produce the desired result. What is the correct approach then?
You can try so:
df1['new'] = df1['c1'] + df2['c2']
For an in-place solution, I recommend pd.Series.replace:
df1['c1'].replace('', df2['c2'], inplace=True)
print(df1)
c1
0 A
1 B
2 C
3 D
4 E
5 F
6
7 G
Background: I have a matrix which represents the distance between two points. In this matrix both rows and columns are the data points. For example:
A B C
A 0 999 3
B 999 0 999
C 3 999 0
In this toy example let's say I want to drop C for some reason, because it is far away from any other point. So I first aggregate the count:
df["far_count"] = df[df == 999].count()
and then batch remove them:
df = df[df["far_count"] == 2]
In this example this looks a bit redundant but please imagine that I have many data points like this (say in the order of 10Ks)
The problem with the above batch removal is that I would like to remove rows and columns in the same time (instead of just rows) and it is unclear to me how to do so elegantly. A naive way is to get a list of such data points and put it in a loop and then:
for item in list:
df.drop(item, axis=1).drop(item, axis=0)
But I was wondering if there is a better way. (Bonus if we could skip the intermdiate step far_count)
np.random.seed([3,14159])
idx = pd.Index(list('ABCDE'))
a = np.random.randint(3, size=(5, 5))
df = pd.DataFrame(
a.T.dot(a) * (1 - np.eye(5, dtype=int)),
idx, idx)
df
A B C D E
A 0 4 2 4 2
B 4 0 1 5 2
C 2 1 0 2 6
D 4 5 2 0 3
E 2 2 6 3 0
l = ['A', 'C']
m = df.index.isin(l)
df.loc[~m, ~m]
B D E
B 0 5 2
D 5 0 3
E 2 3 0
For your specific case, because the array is symmetric you only need to check one dimension.
m = (df.values == 999).sum(0) == len(df) - 1
In [66]: x = pd.DataFrame(np.triu(df), df.index, df.columns)
In [67]: x
Out[67]:
A B C
A 0 999 3
B 0 0 999
C 0 0 0
In [68]: mask = x.ne(999).all(1) | x.ne(999).all(0)
In [69]: df.loc[mask, mask]
Out[69]:
A C
A 0 3
C 3 0
I am trying to get an output where I wish to add column d in d1 and d2 where a b c are same (like groupby).
For example
d1 = pd.DataFrame([[1,2,3,4]],columns=['a','b','c','d'])
d2 = pd.DataFrame([[1,2,3,4],[2,3,4,5]],columns=['a','b','c','d'])
then I'd like to get an output as
a b c d
0 1 2 3 8
1 2 3 4 5
Merging the two data frames and adding the resultant column d where a b c are same.
d1.add(d2) or radd gives me an aggregate of all columns
The solution should be a DataFrame which can be added again to another similarly.
Any help is appreciated.
You can use set_index first:
print (d2.set_index(['a','b','c'])
.add(d1.set_index(['a','b','c']), fill_value=0)
.astype(int)
.reset_index())
a b c d
0 1 2 3 8
1 2 3 4 5
df = pd.concat([d1, d2])
df.drop_duplicates()
a b c d
0 1 2 3 4
1 2 3 4 5