Setting the index after merging with pandas? - python

Executing the following merge
import pandas as pd
s = pd.Series(range(5, 10), index=range(10, 15), name='score')
df = pd.DataFrame({'id': (11, 13), 'value': ('a', 'b')})
pd.merge(s, df, 'left', left_index=True, right_on='id')
results in this data frame:
score id value
NaN 5 10 NaN
0.0 6 11 a
NaN 7 12 NaN
1.0 8 13 b
NaN 9 14 NaN
Why does Pandas take the index from the right data frame as the index for the result, instead of the index from the left series, even though I specified both a left merge and left_index=True? The documentation says
left: use only keys from left frame
which I interpreted differently from the result I am actually getting. What I expected was the following data frame.
score id value
10 5 10 NaN
11 6 11 a
12 7 12 NaN
13 8 13 b
14 9 14 NaN
I am using Python 3.7.5 with Pandas 0.25.3.

Here's what happens:
the output index is the intersection of the index/column merge keys [0, 1].
missing keys are replaced with NaN
NaNs result in the index type being upcasted to float
To set the index, just assign to it:
s2 = pd.merge(s, df, how='left', left_index=True, right_on='id')
s2.index = s.index
score id value
10 5 10 NaN
11 6 11 a
12 7 12 NaN
13 8 13 b
14 9 14 NaN
You can also merge on s (just because I dislike calling pd.merge directly):
(s.to_frame()
.merge(df, how='left', left_index=True, right_on='id')
.set_axis(s.index, axis=0, inplace=False))
score id value
10 5 10 NaN
11 6 11 a
12 7 12 NaN
13 8 13 b
14 9 14 NaN

You can do this with reset_index:
df = pd.merge(s,df, 'left', left_index=True, right_on='id').reset_index(drop=True).set_index('id').rename_axis(index=None)
df.insert(1, 'id', df.index)
score id value
10 5 10 NaN
11 6 11 a
12 7 12 NaN
13 8 13 b
14 9 14 NaN

Since I do not need the duplicated information in both the id column and the index, I went with a combination of the answers from cs95 and oppressionslayer, and did the following:
pd.merge(s, df, 'left', left_index=True, right_on='id').set_index('id')
Which results in this data frame:
score value
id
10 5 NaN
11 6 a
12 7 NaN
13 8 b
14 9 NaN
Since this is different from what I initially asked for, I am leaving the answer from cs95 as the accepted answer, but I think this use case needs to be documented as well.

Related

Second lowest value over the past 756 days in pandas [duplicate]

I need to get the rolling 2nd largest value of a df.
To get the largest value I do
max = df.sort_index(ascending=True).rolling(10).max()
When I try this, python throws an error
max = df.sort_index(ascending=True).rolling(10).nlargest(2)
AttributeError: 'Rolling' object has no attribute 'nlargest'
Is this a bug? What else can I use that is performant?
I'd do something like this:
df.rolling(10).apply(lambda x: pd.Series(x).nlargest(2).iloc[-1])
Use np.sort in descending order and select second value:
np.random.seed(2019)
df = pd.DataFrame({
'B': np.random.randint(20, size=15)
})
print (df)
B
0 8
1 18
2 5
3 15
4 12
5 10
6 16
7 16
8 7
9 5
10 19
11 12
12 16
13 18
14 5
a = df.rolling(10).apply(lambda x: -np.sort(-x)[1])
#alternative
#a = df.rolling(10).apply(lambda x: np.sort(x)[-2])
print (a)
B
0 NaN
1 NaN
2 NaN
3 NaN
4 NaN
5 NaN
6 NaN
7 NaN
8 NaN
9 16.0
10 18.0
11 16.0
12 16.0
13 18.0
14 18.0

Positional indexing with NA values

I need to index the dataframe from positional index, but I got NA values in previous operation and I wanna preserve it. How could I achieve this?
df1
NaN
1
NaN
NaN
NaN
6
df2
0 10
1 15
2 13
3 15
4 16
5 17
6 17
7 18
8 10
df3
0 15
1 17
The output I want
NaN
15
NaN
NaN
NaN
17
df2.iloc(df1)
IndexError: indices are out-of-bounds
.iloc method in this case drive to a unbound error, I think .iloc is not available here. df3 is another output generated by .loc, but I don't know how to add NaN between them. If you can achieve output by using df1 and df3 is also ok
If df1 and df2 has same index values use for replace non missing values by values from another DataFrame DataFrame.mask with DataFrame.isna:
df1 = df2.mask(df1.isna())
print (df1)
col
0 NaN
1 15.0
2 NaN
3 NaN
4 NaN
5 17.0

Merging/Concat/Joining two dataframes

i have a pandas dataframe with a distinct code identifier as detailed below:
df1 = pd.DataFrame([['a', 1], ['b', 2],['c', 3],['d', 4],['e', 5],['f', 5]],
columns=['code', 'value1'])
with a second dataframe with the following
df2 = pd.DataFrame([['a', 11], ['b', 12],['c', 13],['d', 14],['e', 15],['f', 16],['g', 17], ['h', 2],['i', 3],['j', 4],['k', 5],['l', 5]],
columns=['code', 'value2'])
i would like to only see the codes identified in df1 (i.e a-f) and have a third column entitled value2.
I have tried
df1 = df1.join(df2, on = 'Code')
but i keep getting a value of NaN
I have looked at several places and seen merge, concat and join, but none of them appear to work
try this:
df1 = df1.merge(df2, on = 'code')
since you named the column 'code' not 'Code'
To only see the codes identified in df1 (i.e a-f) and have a third column entitled value2, you should use merge method with how='inner' and on='code:
>>> df1.merge(df2, how='inner', on='code')
code value1 value2
0 a 1 11
1 b 2 12
2 c 3 13
3 d 4 14
4 e 5 15
5 f 5 16
Use:
>>> df1.merge(df2, how='inner', on='code')
code value1 value2
0 a 1 11
1 b 2 12
2 c 3 13
3 d 4 14
4 e 5 15
5 f 5 16
Or do you mean by with how='outer' and merge?
>>> df1.merge(df2, how='outer', on='code')
code value1 value2
0 a 1.0 11
1 b 2.0 12
2 c 3.0 13
3 d 4.0 14
4 e 5.0 15
5 f 5.0 16
6 g NaN 17
7 h NaN 2
8 i NaN 3
9 j NaN 4
10 k NaN 5
11 l NaN 5
>>>

Pandas new column replace only show specific pattern value in new column

Index value
1 880770000-t-ptt-018-108
2 Nan
3 760770000-t-ptm-001-107
4 Date
5 11/20/2020
6 607722991-t-ptr-001-888
7 NaN
8 Date
9 10/25/2020
10 12/30/2019
11 967722944-t-ptq-020-888
I want this in next column specific pattern values to be only shown in new column in same dataframe and other values to be replace by NaN like this. the original table has 200k rows and 22 columns the pattern has above 5000 combinations.
Index value
1 880770000-t-ptt-018-108
2 Nan
3 760770000-t-ptm-001-107
4 NaN
5 Nan
6 607722991-t-ptr-001-888
7 NaN
8 NaN
9 NaN
10 NaN
11 967722944-t-ptq-020-888
df['value'] = df['value'].apply(lambda x: x if "-t-" in x else np.NaN)

Best way to join / merge by range in pandas

I'm frequently using pandas for merge (join) by using a range condition.
For instance if there are 2 dataframes:
A (A_id, A_value)
B (B_id,B_low, B_high, B_name)
which are big and approximately of the same size (let's say 2M records each).
I would like to make an inner join between A and B, so A_value would be between B_low and B_high.
Using SQL syntax that would be:
SELECT *
FROM A,B
WHERE A_value between B_low and B_high
and that would be really easy, short and efficient.
Meanwhile in pandas the only way (that's not using loops that I found), is by creating a dummy column in both tables, join on it (equivalent to cross-join) and then filter out unneeded rows. That sounds heavy and complex:
A['dummy'] = 1
B['dummy'] = 1
Temp = pd.merge(A,B,on='dummy')
Result = Temp[Temp.A_value.between(Temp.B_low,Temp.B_high)]
Another solution that I had is by applying on each of A value a search function on B by usingB[(x>=B.B_low) & (x<=B.B_high)] mask, but it sounds inefficient as well and might require index optimization.
Is there a more elegant and/or efficient way to perform this action?
Setup
Consider the dataframes A and B
A = pd.DataFrame(dict(
A_id=range(10),
A_value=range(5, 105, 10)
))
B = pd.DataFrame(dict(
B_id=range(5),
B_low=[0, 30, 30, 46, 84],
B_high=[10, 40, 50, 54, 84]
))
A
A_id A_value
0 0 5
1 1 15
2 2 25
3 3 35
4 4 45
5 5 55
6 6 65
7 7 75
8 8 85
9 9 95
B
B_high B_id B_low
0 10 0 0
1 40 1 30
2 50 2 30
3 54 3 46
4 84 4 84
numpy
The ✌easiest✌ way is to use numpy broadcasting.
We look for every instance of A_value being greater than or equal to B_low while at the same time A_value is less than or equal to B_high.
a = A.A_value.values
bh = B.B_high.values
bl = B.B_low.values
i, j = np.where((a[:, None] >= bl) & (a[:, None] <= bh))
pd.concat([
A.loc[i, :].reset_index(drop=True),
B.loc[j, :].reset_index(drop=True)
], axis=1)
A_id A_value B_high B_id B_low
0 0 5 10 0 0
1 3 35 40 1 30
2 3 35 50 2 30
3 4 45 50 2 30
To address the comments and give something akin to a left join, I appended the part of A that doesn't match.
pd.concat([
A.loc[i, :].reset_index(drop=True),
B.loc[j, :].reset_index(drop=True)
], axis=1).append(
A[~np.in1d(np.arange(len(A)), np.unique(i))],
ignore_index=True, sort=False
)
A_id A_value B_id B_low B_high
0 0 5 0.0 0.0 10.0
1 3 35 1.0 30.0 40.0
2 3 35 2.0 30.0 50.0
3 4 45 2.0 30.0 50.0
4 1 15 NaN NaN NaN
5 2 25 NaN NaN NaN
6 5 55 NaN NaN NaN
7 6 65 NaN NaN NaN
8 7 75 NaN NaN NaN
9 8 85 NaN NaN NaN
10 9 95 NaN NaN NaN
Not sure that is more efficient, however you can use sql directly (from the module sqlite3 for instance) with pandas (inspired from this question) like:
conn = sqlite3.connect(":memory:")
df2 = pd.DataFrame(np.random.randn(10, 5), columns=["col1", "col2", "col3", "col4", "col5"])
df1 = pd.DataFrame(np.random.randn(10, 5), columns=["col1", "col2", "col3", "col4", "col5"])
df1.to_sql("df1", conn, index=False)
df2.to_sql("df2", conn, index=False)
qry = "SELECT * FROM df1, df2 WHERE df1.col1 > 0 and df1.col1<0.5"
tt = pd.read_sql_query(qry,conn)
You can adapt the query as needed in your application
I don't know how efficient it is, but someone wrote a wrapper that allows you to use SQL syntax with pandas objects. That's called pandasql. The documentation explicitly states that joins are supported. This might be at least easier to read since SQL syntax is very readable.
conditional_join from pyjanitor may be helpful in the abstraction/convenience;:
# pip install pyjanitor
import pandas as pd
import janitor
inner join
A.conditional_join(B,
('A_value', 'B_low', '>='),
('A_value', 'B_high', '<=')
)
A_id A_value B_id B_low B_high
0 0 5 0 0 10
1 3 35 1 30 40
2 3 35 2 30 50
3 4 45 2 30 50
left join
A.conditional_join(
B,
('A_value', 'B_low', '>='),
('A_value', 'B_high', '<='),
how = 'left'
)
A_id A_value B_id B_low B_high
0 0 5 0.0 0.0 10.0
1 1 15 NaN NaN NaN
2 2 25 NaN NaN NaN
3 3 35 1.0 30.0 40.0
4 3 35 2.0 30.0 50.0
5 4 45 2.0 30.0 50.0
6 5 55 NaN NaN NaN
7 6 65 NaN NaN NaN
8 7 75 NaN NaN NaN
9 8 85 NaN NaN NaN
10 9 95 NaN NaN NaN
lets take a simple example:
df=pd.DataFrame([2,3,4,5,6],columns=['A'])
returns
A
0 2
1 3
2 4
3 5
4 6
now lets define a second dataframe
df2=pd.DataFrame([1,6,2,3,5],columns=['B_low'])
df2['B_high']=[2,8,4,6,6]
results in
B_low B_high
0 1 2
1 6 8
2 2 4
3 3 6
4 5 6
here we go; and we want output to be index 3 and A value 5
df.where(df['A']>=df2['B_low']).where(df['A']<df2['B_high']).dropna()
results in
A
3 5.0
I know this is an old question but for newcomers there is now the pandas.merge_asof function that performs join based on closest match.
In case you want to do a merge so that a column of one DataFrame (df_right) is between 2 columns of another DataFrame (df_left) you can do the following:
df_left = pd.DataFrame({
"time_from": [1, 4, 10, 21],
"time_to": [3, 7, 15, 27]
})
df_right = pd.DataFrame({
"time": [2, 6, 16, 25]
})
df_left
time_from time_to
0 1 3
1 4 7
2 10 15
3 21 27
df_right
time
0 2
1 6
2 16
3 25
First, find matches of the right DataFrame that are closest but largest than the left boundary (time_from) of the left DataFrame:
merged = pd.merge_asof(
left=df_1,
right=df_2.rename(columns={"time": "candidate_match_1"}),
left_on="time_from",
right_on="candidate_match_1",
direction="forward"
)
merged
time_from time_to candidate_match_1
0 1 3 2
1 4 7 6
2 10 15 16
3 21 27 25
As you can see the candidate match in index 2 is wrongly matched, as 16 is not between 10 and 15.
Then, find matches of the right DataFrame that are closest but smaller than the right boundary (time_to) of the left DataFrame:
merged = pd.merge_asof(
left=merged,
right=df_2.rename(columns={"time": "candidate_match_2"}),
left_on="time_to",
right_on="candidate_match_2",
direction="backward"
)
merged
time_from time_to candidate_match_1 candidate_match_2
0 1 3 2 2
1 4 7 6 6
2 10 15 16 6
3 21 27 25 25
Finally, keep the matches where the candidate matches are the same, meaning that the value of the right DataFrame are between values of the 2 columns of the left DataFrame:
merged["match"] = None
merged.loc[merged["candidate_match_1"] == merged["candidate_match_2"], "match"] = \
merged.loc[merged["candidate_match_1"] == merged["candidate_match_2"], "candidate_match_1"]
merged
time_from time_to candidate_match_1 candidate_match_2 match
0 1 3 2 2 2
1 4 7 6 6 6
2 10 15 16 6 None
3 21 27 25 25 25

Categories