Joining two df with condition in python - python

I have two df.
df1 has over 2 milion rows and has complete data. I'd like to join data from df2, which has over 70.000 rows but it's structure is a bit complicated.
df1 has for eac row keys KO-STA and KO-PAR.
df2 has in some cases data only on KO-STA, in some cases only on KO-PAR and in some case on both.
I'd like to merge those two df and get the data on Need1 and Need2.
df1
STA_SID DST_SID CC KO_SIFKO KO-STA KO-PAR
135 10021582 28878502 NaN 634 634-83 537-780/9
117 10028732 29999540 NaN 657 657-1729 537-780/4
117 10028732 29999541 NaN 657 657-1729 537-780/4
117 10028732 29999542 NaN 657 657-1729 537-780/4
117 10028732 29999543 NaN 657 657-1729 537-780/4
117 10028732 31356572 NaN 657 657-1729 537-780/4
df2
KO-STA STA-PAR KO-PAR Need1 Need2 \
0 1976-_ 366/2 1976-366/2 Bio 49.500000
1 991-_ 329/128 991-329/128 PH 184.399994
2 2147--- 96/19 2147-96/19 Win 8.850000
3 2048-_ 625/4 2048-625/4 SSE 4.940000
4 2194-_ 285/3 2194-285/3 TI f 163.000000
5 2386--- 97/1 2386-97/1 Bio 49.500000
6 2002-_ 2002/9 2002-2002/9 Win 12.850000
7 1324-_ 62 1324-62 Win 8.850000
8 1625-_ 980/1 1625-980/1 Win 8.850000
9 1625-_ 980/1 1625-980/1 Bio 49.500000
My attempt was with the following code
GURS_ES1 = pd.merge(df1.reset_index(), df2.reset_index(), on = 'KO-STA')
GURS_ES2 = pd.merge(GURS_ES1.reset_index(), df2.reset_index(), on = 'KO-PAR')
But after the first merge, GURS_ES1 has two indexes KO-PAR_x and KO-PAR_y and it doesn't join them as one column. Any recommendations?

I provide you an example to make sure how you can proceed an what is the reason for the behaviour you observed:
First, let's construct our sample data
df1 = pd.DataFrame(np.random.randint(1,3,size=(3,3)),columns=['a1','x1','x2'])
Output
a1 x1 x2
0 1 2 1
1 2 1 1
2 1 2 2
Now, the other dataframe
df2 = pd.DataFrame(np.random.randint(1,3,size=(3,3)),columns=['a2','x1','x2'])
a2 x1 x2
0 2 2 1
1 1 2 2
2 1 1 2
Now, if we merge on only(!) one of the indices which occur in both dataframes, then pandas wants you to be able to reconstruct from which dataframe the index originally came
pd.merge(df1,df2, on='x1')
Output
a1 x1 x2_x a2 x2_y
0 1 2 1 2 1
1 1 2 1 1 2
2 1 2 2 2 1
3 1 2 2 1 2
4 2 1 1 1 2
Now, the easiest way to get rid of this is to drop one of the double occuring columns in one of the dataframes:
pd.merge(df1[df1.columns.drop('x2')], df2, on='x1')
Output
a1 x1 a2 x2
0 1 2 2 1
1 1 2 1 2
2 1 2 2 1
3 1 2 1 2
4 2 1 1 2
But you could also merge on a list of columns. Note that we perform an inner join here, which can reduce the number of rows in the output dataframa significantly (or even lead to empty dataframes if there are no matches on both columns)
pd.merge(df1,df2, on=['x1','x2'])
a1 x1 x2 a2
0 1 2 1 2
1 1 2 2 1

Related

I want to add sub-index in python with pandas [duplicate]

When using groupby(), how can I create a DataFrame with a new column containing an index of the group number, similar to dplyr::group_indices in R. For example, if I have
>>> df=pd.DataFrame({'a':[1,1,1,2,2,2],'b':[1,1,2,1,1,2]})
>>> df
a b
0 1 1
1 1 1
2 1 2
3 2 1
4 2 1
5 2 2
How can I get a DataFrame like
a b idx
0 1 1 1
1 1 1 1
2 1 2 2
3 2 1 3
4 2 1 3
5 2 2 4
(the order of the idx indexes doesn't matter)
Here is the solution using ngroup (available as of pandas 0.20.2) from a comment above by Constantino.
import pandas as pd
df = pd.DataFrame({'a':[1,1,1,2,2,2],'b':[1,1,2,1,1,2]})
df['idx'] = df.groupby(['a', 'b']).ngroup()
df
a b idx
0 1 1 0
1 1 1 0
2 1 2 1
3 2 1 2
4 2 1 2
5 2 2 3
Here's a concise way using drop_duplicates and merge to get a unique identifier.
group_vars = ['a','b']
df.merge( df.drop_duplicates( group_vars ).reset_index(), on=group_vars )
a b index
0 1 1 0
1 1 1 0
2 1 2 2
3 2 1 3
4 2 1 3
5 2 2 5
The identifier in this case goes 0,2,3,5 (just a residual of original index) but this could be easily changed to 0,1,2,3 with an additional reset_index(drop=True).
Update: Newer versions of pandas (0.20.2) offer a simpler way to do this with the ngroup method as noted in a comment to the question above by #Constantino and a subsequent answer by #CalumYou. I'll leave this here as an alternate approach but ngroup seems like the better way to do this in most cases.
A simple way to do that would be to concatenate your grouping columns (so that each combination of their values represents a uniquely distinct element), then convert it to a pandas Categorical and keep only its labels:
df['idx'] = pd.Categorical(df['a'].astype(str) + '_' + df['b'].astype(str)).codes
df
a b idx
0 1 1 0
1 1 1 0
2 1 2 1
3 2 1 2
4 2 1 2
5 2 2 3
Edit: changed labels properties to codes as the former seem to be deprecated
Edit2: Added a separator as suggested by Authman Apatira
Definetely not the most straightforward solution, but here is what I would do (comments in the code):
df=pd.DataFrame({'a':[1,1,1,2,2,2],'b':[1,1,2,1,1,2]})
#create a dummy grouper id by just joining desired rows
df["idx"] = df[["a","b"]].astype(str).apply(lambda x: "".join(x),axis=1)
print df
That would generate an unique idx for each combination of a and b.
a b idx
0 1 1 11
1 1 1 11
2 1 2 12
3 2 1 21
4 2 1 21
5 2 2 22
But this is still a rather silly index (think about some more complex values in columns a and b. So let's clear the index:
# create a dictionary of dummy group_ids and their index-wise representation
dict_idx = dict(enumerate(set(df["idx"])))
# switch keys and values, so you can use dict in .replace method
dict_idx = {y:x for x,y in dict_idx.iteritems()}
#replace values with the generated dict
df["idx"].replace(dict_idx,inplace=True)
print df
That would produce the desired output:
a b idx
0 1 1 0
1 1 1 0
2 1 2 1
3 2 1 2
4 2 1 2
5 2 2 3
A way that I believe is faster than the current accepted answer by about an order of magnitude (timing results below):
def create_index_usingduplicated(df, grouping_cols=['a', 'b']):
df.sort_values(grouping_cols, inplace=True)
# You could do the following three lines in one, I just thought
# this would be clearer as an explanation of what's going on:
duplicated = df.duplicated(subset=grouping_cols, keep='first')
new_group = ~duplicated
return new_group.cumsum()
Timing results:
a = np.random.randint(0, 1000, size=int(1e5))
b = np.random.randint(0, 1000, size=int(1e5))
df = pd.DataFrame({'a': a, 'b': b})
In [6]: %timeit df['idx'] = pd.Categorical(df['a'].astype(str) + df['b'].astype(str)).codes
1 loop, best of 3: 375 ms per loop
In [7]: %timeit df['idx'] = create_index_usingduplicated(df, grouping_cols=['a', 'b'])
100 loops, best of 3: 17.7 ms per loop
I'm not sure this is such a trivial problem. Here is a somewhat convoluted solution that first sorts the grouping columns and then checks whether each row is different than the previous row and if so accumulates by 1. Check further below for an answer with string data.
df.sort_values(['a', 'b']).diff().fillna(0).ne(0).any(1).cumsum().add(1)
Output
0 1
1 1
2 2
3 3
4 3
5 4
dtype: int64
So breaking this up into steps, lets see the output of df.sort_values(['a', 'b']).diff().fillna(0) which checks if each row is different than the previous row. Any non-zero entry indicates a new group.
a b
0 0.0 0.0
1 0.0 0.0
2 0.0 1.0
3 1.0 -1.0
4 0.0 0.0
5 0.0 1.0
A new group only need to have a single column different so this is what .ne(0).any(1) checks - not equal to 0 for any of the columns. And then just a cumulative sum to keep track of the groups.
Answer for columns as strings
#create fake data and sort it
df=pd.DataFrame({'a':list('aabbaccdc'),'b':list('aabaacddd')})
df1 = df.sort_values(['a', 'b'])
output of df1
a b
0 a a
1 a a
4 a a
3 b a
2 b b
5 c c
6 c d
8 c d
7 d d
Take similar approach by checking if group has changed
df1.ne(df1.shift().bfill()).any(1).cumsum().add(1)
0 1
1 1
4 1
3 2
2 3
5 4
6 5
8 5
7 6

Loop and merge/update data

I have two data frames which look like:
DF1:
x_id y_id
1 NaN
2 NaN
3 NaN
4 NaN
5 NaN
6 NaN
DF2:
x_id y_id
1 1
2 1
3 1
4 2
5 2
6 2
1 3
3 3
: :
: :
3 y(i)
So, I want to merge/insert y_id from DF2 into y_id in DF1 in each iteration of the loop.
What I have so far:
count = df2['y_id'].unique()
for i in count:
new_df = df1.merge(df2['y_id']==i], how='inner', left_on='x_id', right_on='x_id')
While this create a new dataframe for each iteration of the loop, I think there should be a better way of doing this.
I want my final data frame to look like:
DF3:
x_id y_id
1 3
2 1
3 y(i)
4 2
5 2
6 2
Essentially what I want to do is group DF2 by y_id and merge them in a sorted order. So we can see in DF2 the values 1 and 3 have y_id = 1 and then further down the column they have y_id = 3. Since three is >1 I would like to use this value (ie. the greatest or most recent if we were working with dates etc.)
What I want to do is similar to an update statement in SQL where we update the column and set the row = y_id, taking the most recent value.
Hope I have explained sufficiently, any questions just ask.
Thanks
You can drop_duplicates before merge
df1=df1.drop('y_id',1).merge(df2.drop_duplicates('x_id',keep='last'),on='x_id')
df1
Out[469]:
x_id y_id
0 1 3
1 2 1
2 3 3
3 4 2
4 5 2
5 6 2

Pandas combine 2 Dataframes and overwrite values

I've looked into pandas join, merge, concat with different param values (how to join, indexing, axis=1, etc) but nothing solves it!
I have two dataframes:
x = pd.DataFrame(np.random.randn(4,4))
y = pd.DataFrame(np.random.randn(4,4),columns=list(range(2,6)))
x
Out[67]:
0 1 2 3
0 -0.036327 -0.594224 0.469633 -0.649221
1 1.891510 0.164184 -0.010760 -0.848515
2 -0.383299 1.416787 0.719434 0.025509
3 0.097420 -0.868072 -0.591106 -0.672628
y
Out[68]:
2 3 4 5
0 -0.328402 -0.001436 -1.339613 -0.721508
1 0.408685 1.986148 0.176883 0.146694
2 -0.638341 0.018629 -0.319985 -1.832628
3 0.125003 1.134909 0.500017 0.319324
I'd like to combine to one dataframe where the values from y in columns 2 and 3 overwrite those of x and then columns 4 and 5 are inserted on the end:
new
Out[100]:
0 1 2 3 4 5
0 -0.036327 -0.594224 -0.328402 -0.001436 -1.339613 -0.721508
1 1.891510 0.164184 0.408685 1.986148 0.176883 0.146694
2 -0.383299 1.416787 -0.638341 0.018629 -0.319985 -1.832628
3 0.097420 -0.868072 0.125003 1.134909 0.500017 0.319324
You can try combine_first:
df = y.combine_first(x)
You need update and combine_first
x.update(y)
x.combine_first(y)
Out[1417]:
0 1 2 3 4 5
0 -1.075266 1.044069 -0.423888 0.247130 0.008867 2.058995
1 0.122782 -0.444159 1.528181 0.595939 0.155170 1.693578
2 -0.825819 0.395140 -0.171900 -0.161182 -2.016067 0.223774
3 -0.009081 -0.148430 -0.028605 0.092074 1.355105 -0.003027
Or you using pd.concat + intersection
pd.concat([x.drop(x.columns.intersection(y.columns),1),y],1)
Out[1432]:
0 1 2 3 4 5
0 -1.075266 1.044069 -0.423888 0.247130 0.008867 2.058995
1 0.122782 -0.444159 1.528181 0.595939 0.155170 1.693578
2 -0.825819 0.395140 -0.171900 -0.161182 -2.016067 0.223774
3 -0.009081 -0.148430 -0.028605 0.092074 1.355105 -0.003027

Python / Pandas: Eliminate for Loop using 2 DataFrames

First time asking a question here so hopefully I will make my issue clear. I am trying to understand how to better apply a list of scenarios (via for loop) to the same dataset and summarize results. *Note that once a scenario is applied, and I pull the relevant statistical data from dataframe and put into the summary table, I do not need to retain the information. Iterrows is painfully slow as I have tens of thousands of scenarios I want to run. Thank you for taking the time to review.
I have two Pandas dataframes: df_analysts and df_results:
1) df_analysts contains a specific list of factors (e.g. TB,JK,SF,PWR) scenarios of weights (e.g. 50,50,50,50)
TB JK SF PWR
0 50 50 50 50
1 50 50 50 100
2 50 50 50 150
3 50 50 50 200
4 50 50 50 250
2) df_results holds results by date and group and entrant an then ranking by each factor, finally it has the final finish result.
Date GR Ent TB-R JK-R SF-R PWR-R Fin W1 W2 W2 W4 SUM(W)
0 11182017 1 1 2 1 2 1 2
1 11182017 1 2 3 2 3 2 1
2 11182017 1 3 1 3 1 3 3
3 11182017 2 1 1 2 2 1 1
4 11182017 2 2 2 1 1 2 1
3) I am using iterrows to
loop through each scenario in the df_analysts dataframe
apply weight scenario to each factor rank (if rank = 1, then 1.0*weight, rank = 2, then 0.68*weight, rank = 3, then 0.32*weight). Those results go into the W1-W4 columns.
Sum the W1-W4 columns.
Rank the SUM(W) column.
Result sample below for a single scenario (e.g. 50,50,50,50)
Date GR Ent TB-R JK-R SF-R PWR-R Fin W1 W2 W2 W4 SUM(W) Rank
0 11182017 1 1 2 1 2 1 1 34 50 34 50 168 1
1 11182017 1 2 3 2 3 2 3 16 34 16 34 100 3
2 11182017 1 3 1 3 1 3 2 50 16 50 16 132 2
3 11182017 2 1 2 2 2 1 1 34 34 34 50 152 2
4 11182017 2 2 1 1 1 2 1 50 50 50 34 184 1
4) Finally, for each scenario, I am creating a new dataframe for the summary results (df_summary) which logs the factor / weight scenario used (from df_analysts) and compares the RANK result to the Finish by date and group and keeps a tally where they land. Sample below (only the 50,50,50,50 scenario is shown above which results in a 1,1).
Factors Weights Top Top2
0 (TB,JK,SF,PWR) (50,50,50,50) 1 1
1 (TB,JK,SF,PWR) (50,50,50,100) 1 0
2 (TB,JK,SF,PWR) (50,50,50,150) 1 1
3 (TB,JK,SF,PWR) (50,50,50,200) 1 0
4 (TB,JK,SF,PWR) (50,50,50,250) 1 1
You could merge your analyst and results dataframe and then perform the calculations.
def factor_rank(x,y):
if (x==1): return y
elif (x==2): return y*0.68
elif (x==3): return y*0.32
df_analysts.index.name='SCENARIO'
df_analysts.reset_index(inplace=True)
df_analysts['key'] = 1
df_results['key'] = 1
df = pd.merge(df_analysts, df_results, on='key')
df.drop(['key'],axis=1,inplace=True)
df['W1'] = df.apply(lambda r: factor_rank(r['TB-R'], r['TB']), axis=1)
df['W2'] = df.apply(lambda r: factor_rank(r['JK-R'], r['JK']), axis=1)
df['W3'] = df.apply(lambda r: factor_rank(r['SF-R'], r['SF']), axis=1)
df['W4'] = df.apply(lambda r: factor_rank(r['PWR-R'], r['PWR']), axis=1)
df['SUM(W)'] = df.W1 + df.W1 + df.W3 + df.W4
df["rank"] = df.groupby(['GR','SCENARIO'])['SUM(W)'].rank(ascending=False)
You may also want to check out this question which deals with improving processing times on row based calculations:
How to apply a function to mulitple columns of a pandas DataFrame in parallel

Pandas: assign an index to each group identified by groupby

When using groupby(), how can I create a DataFrame with a new column containing an index of the group number, similar to dplyr::group_indices in R. For example, if I have
>>> df=pd.DataFrame({'a':[1,1,1,2,2,2],'b':[1,1,2,1,1,2]})
>>> df
a b
0 1 1
1 1 1
2 1 2
3 2 1
4 2 1
5 2 2
How can I get a DataFrame like
a b idx
0 1 1 1
1 1 1 1
2 1 2 2
3 2 1 3
4 2 1 3
5 2 2 4
(the order of the idx indexes doesn't matter)
Here is the solution using ngroup (available as of pandas 0.20.2) from a comment above by Constantino.
import pandas as pd
df = pd.DataFrame({'a':[1,1,1,2,2,2],'b':[1,1,2,1,1,2]})
df['idx'] = df.groupby(['a', 'b']).ngroup()
df
a b idx
0 1 1 0
1 1 1 0
2 1 2 1
3 2 1 2
4 2 1 2
5 2 2 3
Here's a concise way using drop_duplicates and merge to get a unique identifier.
group_vars = ['a','b']
df.merge( df.drop_duplicates( group_vars ).reset_index(), on=group_vars )
a b index
0 1 1 0
1 1 1 0
2 1 2 2
3 2 1 3
4 2 1 3
5 2 2 5
The identifier in this case goes 0,2,3,5 (just a residual of original index) but this could be easily changed to 0,1,2,3 with an additional reset_index(drop=True).
Update: Newer versions of pandas (0.20.2) offer a simpler way to do this with the ngroup method as noted in a comment to the question above by #Constantino and a subsequent answer by #CalumYou. I'll leave this here as an alternate approach but ngroup seems like the better way to do this in most cases.
A simple way to do that would be to concatenate your grouping columns (so that each combination of their values represents a uniquely distinct element), then convert it to a pandas Categorical and keep only its labels:
df['idx'] = pd.Categorical(df['a'].astype(str) + '_' + df['b'].astype(str)).codes
df
a b idx
0 1 1 0
1 1 1 0
2 1 2 1
3 2 1 2
4 2 1 2
5 2 2 3
Edit: changed labels properties to codes as the former seem to be deprecated
Edit2: Added a separator as suggested by Authman Apatira
Definetely not the most straightforward solution, but here is what I would do (comments in the code):
df=pd.DataFrame({'a':[1,1,1,2,2,2],'b':[1,1,2,1,1,2]})
#create a dummy grouper id by just joining desired rows
df["idx"] = df[["a","b"]].astype(str).apply(lambda x: "".join(x),axis=1)
print df
That would generate an unique idx for each combination of a and b.
a b idx
0 1 1 11
1 1 1 11
2 1 2 12
3 2 1 21
4 2 1 21
5 2 2 22
But this is still a rather silly index (think about some more complex values in columns a and b. So let's clear the index:
# create a dictionary of dummy group_ids and their index-wise representation
dict_idx = dict(enumerate(set(df["idx"])))
# switch keys and values, so you can use dict in .replace method
dict_idx = {y:x for x,y in dict_idx.iteritems()}
#replace values with the generated dict
df["idx"].replace(dict_idx,inplace=True)
print df
That would produce the desired output:
a b idx
0 1 1 0
1 1 1 0
2 1 2 1
3 2 1 2
4 2 1 2
5 2 2 3
A way that I believe is faster than the current accepted answer by about an order of magnitude (timing results below):
def create_index_usingduplicated(df, grouping_cols=['a', 'b']):
df.sort_values(grouping_cols, inplace=True)
# You could do the following three lines in one, I just thought
# this would be clearer as an explanation of what's going on:
duplicated = df.duplicated(subset=grouping_cols, keep='first')
new_group = ~duplicated
return new_group.cumsum()
Timing results:
a = np.random.randint(0, 1000, size=int(1e5))
b = np.random.randint(0, 1000, size=int(1e5))
df = pd.DataFrame({'a': a, 'b': b})
In [6]: %timeit df['idx'] = pd.Categorical(df['a'].astype(str) + df['b'].astype(str)).codes
1 loop, best of 3: 375 ms per loop
In [7]: %timeit df['idx'] = create_index_usingduplicated(df, grouping_cols=['a', 'b'])
100 loops, best of 3: 17.7 ms per loop
I'm not sure this is such a trivial problem. Here is a somewhat convoluted solution that first sorts the grouping columns and then checks whether each row is different than the previous row and if so accumulates by 1. Check further below for an answer with string data.
df.sort_values(['a', 'b']).diff().fillna(0).ne(0).any(1).cumsum().add(1)
Output
0 1
1 1
2 2
3 3
4 3
5 4
dtype: int64
So breaking this up into steps, lets see the output of df.sort_values(['a', 'b']).diff().fillna(0) which checks if each row is different than the previous row. Any non-zero entry indicates a new group.
a b
0 0.0 0.0
1 0.0 0.0
2 0.0 1.0
3 1.0 -1.0
4 0.0 0.0
5 0.0 1.0
A new group only need to have a single column different so this is what .ne(0).any(1) checks - not equal to 0 for any of the columns. And then just a cumulative sum to keep track of the groups.
Answer for columns as strings
#create fake data and sort it
df=pd.DataFrame({'a':list('aabbaccdc'),'b':list('aabaacddd')})
df1 = df.sort_values(['a', 'b'])
output of df1
a b
0 a a
1 a a
4 a a
3 b a
2 b b
5 c c
6 c d
8 c d
7 d d
Take similar approach by checking if group has changed
df1.ne(df1.shift().bfill()).any(1).cumsum().add(1)
0 1
1 1
4 1
3 2
2 3
5 4
6 5
8 5
7 6

Categories