Merging dataframes with different dimensions and related data - python

I have 2 dataframes with different size with related data to be merged in an efficient way:
master_df = pd.DataFrame({'kpi_1': [1,2,3,4]},
index=['dn1_app1_bar.com',
'dn1_app2_bar.com',
'dn2_app1_foo.com',
'dn2_app2_foo.com'])
guard_df = pd.DataFrame({'kpi_2': [1,2],
'kpi_3': [10,20]},
index=['dn1_bar.com', 'dn2_foo.com'])
master_df:
kpi_1
dn1_app1_bar.com 1
dn1_app2_bar.com 2
dn2_app1_foo.com 3
dn2_app2_foo.com 4
guard_df:
kpi_2 kpi_3
dn1_bar.com 1 10
dn2_foo.com 2 20
I want to get a dataframe with values from a guard_df's row indexed with <group>_<name> "propagated' to all master_df's rows matching
<group>_.*_<name>.
Expected result:
kpi_1 kpi_2 kpi_3
dn1_app1_bar.com 1 1.0 10.0
dn1_app2_bar.com 2 1.0 10.0
dn2_app1_foo.com 3 2.0 20.0
dn2_app2_foo.com 4 2.0 20.0
What I've managed so far is the following basic approach:
def eval_base_dn(dn):
chunks = dn.split('_')
return '_'.join((chunks[0], chunks[2]))
for dn in master_df.index:
for col in guard_df.columns:
master_df.loc[dn, col] = guard_df.loc[eval_base_dn(dn), col]
but I'm looking for some more performant way to "broadcast" the values and merge the dataframes.

If use pandas 0.25+ is possible pass array, here index to on parameter of merge with left join:
master_df = master_df.merge(guard_df,
left_on=master_df.index.str.replace('_.+_', '_'),
right_index=True,
how='left')
print (master_df)
kpi_1 kpi_2 kpi_3
dn1_app1_bar.com 1 1 10
dn1_app2_bar.com 2 1 10
dn2_app1_foo.com 3 2 20
dn2_app2_foo.com 4 2 20

Try this one:
>>> pd.merge(master_df.assign(guard_df_id=master_df.index.str.split("_").map(lambda x: "{0}_{1}".format(x[0], x[-1]))), guard_df, left_on="guard_df_id", right_index=True).drop(["guard_df_id"], axis=1)
kpi_1 kpi_2 kpi_3
dn1_app1_bar.com 1 1 10
dn1_app2_bar.com 2 1 10
dn2_app1_foo.com 3 2 20
dn2_app2_foo.com 4 2 20

Related

I want to add sub-index in python with pandas [duplicate]

When using groupby(), how can I create a DataFrame with a new column containing an index of the group number, similar to dplyr::group_indices in R. For example, if I have
>>> df=pd.DataFrame({'a':[1,1,1,2,2,2],'b':[1,1,2,1,1,2]})
>>> df
a b
0 1 1
1 1 1
2 1 2
3 2 1
4 2 1
5 2 2
How can I get a DataFrame like
a b idx
0 1 1 1
1 1 1 1
2 1 2 2
3 2 1 3
4 2 1 3
5 2 2 4
(the order of the idx indexes doesn't matter)
Here is the solution using ngroup (available as of pandas 0.20.2) from a comment above by Constantino.
import pandas as pd
df = pd.DataFrame({'a':[1,1,1,2,2,2],'b':[1,1,2,1,1,2]})
df['idx'] = df.groupby(['a', 'b']).ngroup()
df
a b idx
0 1 1 0
1 1 1 0
2 1 2 1
3 2 1 2
4 2 1 2
5 2 2 3
Here's a concise way using drop_duplicates and merge to get a unique identifier.
group_vars = ['a','b']
df.merge( df.drop_duplicates( group_vars ).reset_index(), on=group_vars )
a b index
0 1 1 0
1 1 1 0
2 1 2 2
3 2 1 3
4 2 1 3
5 2 2 5
The identifier in this case goes 0,2,3,5 (just a residual of original index) but this could be easily changed to 0,1,2,3 with an additional reset_index(drop=True).
Update: Newer versions of pandas (0.20.2) offer a simpler way to do this with the ngroup method as noted in a comment to the question above by #Constantino and a subsequent answer by #CalumYou. I'll leave this here as an alternate approach but ngroup seems like the better way to do this in most cases.
A simple way to do that would be to concatenate your grouping columns (so that each combination of their values represents a uniquely distinct element), then convert it to a pandas Categorical and keep only its labels:
df['idx'] = pd.Categorical(df['a'].astype(str) + '_' + df['b'].astype(str)).codes
df
a b idx
0 1 1 0
1 1 1 0
2 1 2 1
3 2 1 2
4 2 1 2
5 2 2 3
Edit: changed labels properties to codes as the former seem to be deprecated
Edit2: Added a separator as suggested by Authman Apatira
Definetely not the most straightforward solution, but here is what I would do (comments in the code):
df=pd.DataFrame({'a':[1,1,1,2,2,2],'b':[1,1,2,1,1,2]})
#create a dummy grouper id by just joining desired rows
df["idx"] = df[["a","b"]].astype(str).apply(lambda x: "".join(x),axis=1)
print df
That would generate an unique idx for each combination of a and b.
a b idx
0 1 1 11
1 1 1 11
2 1 2 12
3 2 1 21
4 2 1 21
5 2 2 22
But this is still a rather silly index (think about some more complex values in columns a and b. So let's clear the index:
# create a dictionary of dummy group_ids and their index-wise representation
dict_idx = dict(enumerate(set(df["idx"])))
# switch keys and values, so you can use dict in .replace method
dict_idx = {y:x for x,y in dict_idx.iteritems()}
#replace values with the generated dict
df["idx"].replace(dict_idx,inplace=True)
print df
That would produce the desired output:
a b idx
0 1 1 0
1 1 1 0
2 1 2 1
3 2 1 2
4 2 1 2
5 2 2 3
A way that I believe is faster than the current accepted answer by about an order of magnitude (timing results below):
def create_index_usingduplicated(df, grouping_cols=['a', 'b']):
df.sort_values(grouping_cols, inplace=True)
# You could do the following three lines in one, I just thought
# this would be clearer as an explanation of what's going on:
duplicated = df.duplicated(subset=grouping_cols, keep='first')
new_group = ~duplicated
return new_group.cumsum()
Timing results:
a = np.random.randint(0, 1000, size=int(1e5))
b = np.random.randint(0, 1000, size=int(1e5))
df = pd.DataFrame({'a': a, 'b': b})
In [6]: %timeit df['idx'] = pd.Categorical(df['a'].astype(str) + df['b'].astype(str)).codes
1 loop, best of 3: 375 ms per loop
In [7]: %timeit df['idx'] = create_index_usingduplicated(df, grouping_cols=['a', 'b'])
100 loops, best of 3: 17.7 ms per loop
I'm not sure this is such a trivial problem. Here is a somewhat convoluted solution that first sorts the grouping columns and then checks whether each row is different than the previous row and if so accumulates by 1. Check further below for an answer with string data.
df.sort_values(['a', 'b']).diff().fillna(0).ne(0).any(1).cumsum().add(1)
Output
0 1
1 1
2 2
3 3
4 3
5 4
dtype: int64
So breaking this up into steps, lets see the output of df.sort_values(['a', 'b']).diff().fillna(0) which checks if each row is different than the previous row. Any non-zero entry indicates a new group.
a b
0 0.0 0.0
1 0.0 0.0
2 0.0 1.0
3 1.0 -1.0
4 0.0 0.0
5 0.0 1.0
A new group only need to have a single column different so this is what .ne(0).any(1) checks - not equal to 0 for any of the columns. And then just a cumulative sum to keep track of the groups.
Answer for columns as strings
#create fake data and sort it
df=pd.DataFrame({'a':list('aabbaccdc'),'b':list('aabaacddd')})
df1 = df.sort_values(['a', 'b'])
output of df1
a b
0 a a
1 a a
4 a a
3 b a
2 b b
5 c c
6 c d
8 c d
7 d d
Take similar approach by checking if group has changed
df1.ne(df1.shift().bfill()).any(1).cumsum().add(1)
0 1
1 1
4 1
3 2
2 3
5 4
6 5
8 5
7 6

Update dataframe column based on another dataframe column without for loop

I have two dataframes df1 and df2.
df1:
id val
1 25
2 40
3 78
df2:
id val
2 8
1 5
Now I want to do something like df1['val'] = df1['val']/df2['val'] for matching id. I can do that by iterating over all df2 rows as df2 is a subset of df1 so it may be missing some values, which I want to keep unchanged. This is what I have right now:
for row in df2.iterrows():
df1.loc[df1['id']==row[1]['id'], 'val'] /= row[1]['val']
df1:
id val
1 5
2 5
3 78
How can I achieve the same without using for loop to improve speed?
Use Series.map with Series.div:
df1['val'] = df1['val'].div(df1['id'].map(df2.set_index('id')['val']), fill_value=1)
print (df1)
id val
0 1 5.0
1 2 5.0
2 3 78.0
Solution with merge with left join:
df1['val'] = df1['val'].div(df1.merge(df2, on='id', how='left')['val_y'], fill_value=1)
print (df1)
id val
0 1 5.0
1 2 5.0
2 3 78.0

Pandas create new dataframe by querying other dataframes without using iterrows

I have two huge dataframes that both have the same id field. I want to make a simple summary dataframe where I show the maximum of specific columns. I understand iterrows() is frowned upon, so are a couple one-liners to do this? I don't understand lambda/apply very well, but maybe this would work here.
Stand-alone example
import pandas as pd
myid = [1,1,2,3,4,4,5]
name =['A','A','B','C','D','D','E']
x = [15,12,3,3,1,4,8]
df1 = pd.DataFrame(list(zip(myid, name, x)),
columns=['myid', 'name', 'x'])
display(df1)
myid = [1,2,2,2,3,4,5,5]
name =['A','B','B','B','C','D','E','E']
y = [9,6,3,4,6,2,8,2]
df2 = pd.DataFrame(list(zip(myid, name, y)),
columns=['myid', 'name', 'y'])
display(df2)
mylist = df['myid'].unique()
df_summary = pd.DataFrame(mylist, columns=['MY_ID'])
## do work here...
Desired output
merge()
named aggregations
df1.merge(df2, on=["myid","name"], how="outer")\
.groupby(["myid","name"], as_index=False).agg(MAX_X=("x","max"),MAX_Y=("y","max"))
myid
name
MAX_X
MAX_Y
0
1
A
15
9
1
2
B
3
6
2
3
C
3
6
3
4
D
4
2
4
5
E
8
8
updated
you have noted that your data frames are large and solution is giving you OOM
logically aggregate first, then merge will use less memory
pd.merge(
df1.groupby(["myid","name"],as_index=False).agg(MAX_X=("x","max")),
df2.groupby(["myid","name"],as_index=False).agg(MAX_Y=("y","max")),
on=["myid","name"]
)
you can try concat+groupby.max
out = (pd.concat((df1,df2),sort=False).groupby(['myid','name']).max()
.add_prefix("Max_").reset_index())
myid name Max_x Max_y
0 1 A 15.0 9.0
1 2 B 3.0 6.0
2 3 C 3.0 6.0
3 4 D 4.0 2.0
4 5 E 8.0 8.0

How to join DataFrames and get max of particular column?

I have two PySpark DataFrames df1 and df2. They have the same names of columns but might have a different number of rows. Also, some combinations of may not exist in one of DataFrames.
df1 =
wpk ipk num
1 2 23.4
1 3 45.5
2 1 0.0
df2 =
wpk ipk num
1 1 12.0
1 3 40.0
2 1 50.0
I want to obtain a new DataFrame df that is the result of the outer joining of df1 and df2. The df should have the same columns, but the column num should be the max of df1 and df2.
The expected result is this one:
wpk ipk num
1 1 12.0
1 2 23.4
1 3 45.5
2 1 50.0
I'm unsure if this is suitable for your problem, however this would be how I would achieve the specified result.
import pandas as pd
df3 = df1.append(df2).groupby(['wpk','ipk'])['num'].max()

Pandas: assign an index to each group identified by groupby

When using groupby(), how can I create a DataFrame with a new column containing an index of the group number, similar to dplyr::group_indices in R. For example, if I have
>>> df=pd.DataFrame({'a':[1,1,1,2,2,2],'b':[1,1,2,1,1,2]})
>>> df
a b
0 1 1
1 1 1
2 1 2
3 2 1
4 2 1
5 2 2
How can I get a DataFrame like
a b idx
0 1 1 1
1 1 1 1
2 1 2 2
3 2 1 3
4 2 1 3
5 2 2 4
(the order of the idx indexes doesn't matter)
Here is the solution using ngroup (available as of pandas 0.20.2) from a comment above by Constantino.
import pandas as pd
df = pd.DataFrame({'a':[1,1,1,2,2,2],'b':[1,1,2,1,1,2]})
df['idx'] = df.groupby(['a', 'b']).ngroup()
df
a b idx
0 1 1 0
1 1 1 0
2 1 2 1
3 2 1 2
4 2 1 2
5 2 2 3
Here's a concise way using drop_duplicates and merge to get a unique identifier.
group_vars = ['a','b']
df.merge( df.drop_duplicates( group_vars ).reset_index(), on=group_vars )
a b index
0 1 1 0
1 1 1 0
2 1 2 2
3 2 1 3
4 2 1 3
5 2 2 5
The identifier in this case goes 0,2,3,5 (just a residual of original index) but this could be easily changed to 0,1,2,3 with an additional reset_index(drop=True).
Update: Newer versions of pandas (0.20.2) offer a simpler way to do this with the ngroup method as noted in a comment to the question above by #Constantino and a subsequent answer by #CalumYou. I'll leave this here as an alternate approach but ngroup seems like the better way to do this in most cases.
A simple way to do that would be to concatenate your grouping columns (so that each combination of their values represents a uniquely distinct element), then convert it to a pandas Categorical and keep only its labels:
df['idx'] = pd.Categorical(df['a'].astype(str) + '_' + df['b'].astype(str)).codes
df
a b idx
0 1 1 0
1 1 1 0
2 1 2 1
3 2 1 2
4 2 1 2
5 2 2 3
Edit: changed labels properties to codes as the former seem to be deprecated
Edit2: Added a separator as suggested by Authman Apatira
Definetely not the most straightforward solution, but here is what I would do (comments in the code):
df=pd.DataFrame({'a':[1,1,1,2,2,2],'b':[1,1,2,1,1,2]})
#create a dummy grouper id by just joining desired rows
df["idx"] = df[["a","b"]].astype(str).apply(lambda x: "".join(x),axis=1)
print df
That would generate an unique idx for each combination of a and b.
a b idx
0 1 1 11
1 1 1 11
2 1 2 12
3 2 1 21
4 2 1 21
5 2 2 22
But this is still a rather silly index (think about some more complex values in columns a and b. So let's clear the index:
# create a dictionary of dummy group_ids and their index-wise representation
dict_idx = dict(enumerate(set(df["idx"])))
# switch keys and values, so you can use dict in .replace method
dict_idx = {y:x for x,y in dict_idx.iteritems()}
#replace values with the generated dict
df["idx"].replace(dict_idx,inplace=True)
print df
That would produce the desired output:
a b idx
0 1 1 0
1 1 1 0
2 1 2 1
3 2 1 2
4 2 1 2
5 2 2 3
A way that I believe is faster than the current accepted answer by about an order of magnitude (timing results below):
def create_index_usingduplicated(df, grouping_cols=['a', 'b']):
df.sort_values(grouping_cols, inplace=True)
# You could do the following three lines in one, I just thought
# this would be clearer as an explanation of what's going on:
duplicated = df.duplicated(subset=grouping_cols, keep='first')
new_group = ~duplicated
return new_group.cumsum()
Timing results:
a = np.random.randint(0, 1000, size=int(1e5))
b = np.random.randint(0, 1000, size=int(1e5))
df = pd.DataFrame({'a': a, 'b': b})
In [6]: %timeit df['idx'] = pd.Categorical(df['a'].astype(str) + df['b'].astype(str)).codes
1 loop, best of 3: 375 ms per loop
In [7]: %timeit df['idx'] = create_index_usingduplicated(df, grouping_cols=['a', 'b'])
100 loops, best of 3: 17.7 ms per loop
I'm not sure this is such a trivial problem. Here is a somewhat convoluted solution that first sorts the grouping columns and then checks whether each row is different than the previous row and if so accumulates by 1. Check further below for an answer with string data.
df.sort_values(['a', 'b']).diff().fillna(0).ne(0).any(1).cumsum().add(1)
Output
0 1
1 1
2 2
3 3
4 3
5 4
dtype: int64
So breaking this up into steps, lets see the output of df.sort_values(['a', 'b']).diff().fillna(0) which checks if each row is different than the previous row. Any non-zero entry indicates a new group.
a b
0 0.0 0.0
1 0.0 0.0
2 0.0 1.0
3 1.0 -1.0
4 0.0 0.0
5 0.0 1.0
A new group only need to have a single column different so this is what .ne(0).any(1) checks - not equal to 0 for any of the columns. And then just a cumulative sum to keep track of the groups.
Answer for columns as strings
#create fake data and sort it
df=pd.DataFrame({'a':list('aabbaccdc'),'b':list('aabaacddd')})
df1 = df.sort_values(['a', 'b'])
output of df1
a b
0 a a
1 a a
4 a a
3 b a
2 b b
5 c c
6 c d
8 c d
7 d d
Take similar approach by checking if group has changed
df1.ne(df1.shift().bfill()).any(1).cumsum().add(1)
0 1
1 1
4 1
3 2
2 3
5 4
6 5
8 5
7 6

Categories