Pandas left join with duplicates - python

I have a pandas data frame like
A = pd.DataFrame({'Name' : ['A', 'A','A'], 'Value' : [1,2,3]})
and another DataFrame
B = pd.DataFrame({'Name': ['A', 'C', 'D', 'A', 'E', 'A'], 'Value1' :[1,2,3,4,5,6]})
When I merge these I get
A.merge(B, how='left', on='Name')
In [4]: A.merge(B, how='left', on='Name')
Out[4]:
Name Value Value1
0 A 1 1
1 A 1 4
2 A 1 6
3 A 2 1
4 A 2 4
5 A 2 6
6 A 3 1
7 A 3 4
8 A 3 6
Anyway to do this merge in a way such that first row with 'A' will match only with first row with 'A' in B, and second with second and third with third.
Final output like
Name Value Value1
0 A 1 1
1 A 2 4
2 A 3 6
Thanks,
I tried doing left merge. I wasnt expecting anything different, but I am looking for a better way to do this.
Doing Inner join doesnt help either
A.merge(B, how='inner', on='Name')
Name Value Value1
0 A 1 1
1 A 1 4
2 A 1 6
3 A 2 1
4 A 2 4
5 A 2 6
6 A 3 1
7 A 3 4
8 A 3 6

Deduplicate with groupby.cumcount and pass it to merge as secondary key:
A.merge(B, how='left',
left_on=['Name', A.groupby('Name').cumcount()],
right_on=['Name', B.groupby('Name').cumcount()]
)#.drop(columns='key_1')
Output:
Name key_1 Value Value1
0 A 0 1 1
1 A 1 2 4
2 A 2 3 6

You are requesting something that is not actually a join.
You can do something like this however:
pd.concat([A, B[B.Name == "A"].reset_index().Value1], axis=1)

Related

Pandas replace columns by merging another dataframe

I have a dataframe df1 looks like this:
id A B
0 1 10 5
1 1 11 6
2 2 10 7
3 2 11 8
And another dataframe df2:
id A
0 1 3
1 2 4
Now I want to replace A column in df1 with the value of A in df2 based on id, so the result should look like this:
id A B
0 1 3 5
1 1 3 6
2 2 4 7
3 2 4 8
There's a way that I can drop column A in df1 first and merge df2 to df1 on id like df1 = df1.drop(['A'], axis=1).merge(df2, how='left', on='id'), but if there're like 10 columns in df2, it will be pretty hard. Is there a more elegant way to do so?
here is one way to do it, by making use of pd.update. However, it requires to set the index on the id, so it can match the two df
df.set_index('id', inplace=True)
df2.set_index('id', inplace=True)
df.update(df2)
df['A'] = df['A'].astype(int) # value by default was of type float
df.reset_index()
id A B
0 1 3 5
1 1 3 6
2 2 4 7
3 2 4 8
Merge just the id column from df to df2, and then combine_first it to the original DataFrame:
df = df[['id']].merge(df2).combine_first(df)
print(df)
Output:
A B id
0 3 5 1
1 3 6 1
2 4 7 2
3 4 8 2

How to remove NaNs and squeeze in a DataFrame - pandas

I was doing some coding and realized something, I think there is an easier way of doing this.
So I have a DataFrame like this:
>>> df = pd.DataFrame({'a': [1, 'A', 2, 'A'], 'b': ['A', 3, 'A', 4]})
a b
0 1 A
1 A 3
2 2 A
3 A 4
And I want to remove all of the As from the data, but I also want to squeeze in the DataFrame, what I mean by squeezing in the DataFrame is to have a result of this:
a b
0 1 3
1 2 4
I have a solution as follows:
a = df['a'][df['a'] != 'A']
b = df['b'][df['b'] != 'A']
df2 = pd.DataFrame({'a': a.tolist(), 'b': b.tolist()})
print(df2)
Which works, but I seem to think there is an easier way, I've stopped coding for a while so not so bright anymore...
Note:
All columns have the same amount of As, there is no problem there.
You can try boolean indexing with loc to remove the A values:
pd.DataFrame({c: df.loc[df[c] != 'A', c].tolist() for c in df})
Result:
a b
0 1 3
1 2 4
This would do:
In [1513]: df.replace('A', np.nan).apply(lambda x: pd.Series(x.dropna().to_numpy()))
Out[1513]:
a b
0 1.0 3.0
1 2.0 4.0
We use can df.melt then filter out 'A' values then df.pivot
out = df.melt().query("value!='A'")
out.index = out.groupby('variable')['variable'].cumcount()
out.pivot(columns='variable', values='value').rename_axis(columns=None)
a b
0 1 3
1 2 4
Details
out = df.melt().query("value!='A'")
variable value
0 a 1
2 a 2
5 b 3
7 b 4
# We set this as index so it helps in `df.pivot`
out.groupby('variable')['variable'].cumcount()
0 0
2 1
5 0
7 1
dtype: int64
out.pivot(columns='variable', values='value').rename_axis(columns=None)
a b
0 1 3
1 2 4
Another alternative
df = df.mask(df.eq('A'))
out = df.stack()
pd.DataFrame(out.groupby(level=1).agg(list).to_dict())
a b
0 1 3
1 2 4
Details
df = df.mask(df.eq('A'))
a b
0 1 NaN
1 NaN 3
2 2 NaN
3 NaN 4
out = df.stack()
0 a 1
1 b 3
2 a 2
3 b 4
dtype: object
pd.DataFrame(out.groupby(level=1).agg(list).to_dict())
a b
0 1 3
1 2 4

Pandas dataframe merge not working as expected with multiple column equality checks

I am trying to merge based on two columns being equal to each other for two Dataframes.
Here is the code:
>>> df.merge(df1, how='left', left_on=['Name', 'Age'], right_on=['Name', 'Age'], suffixes=('', '_#'))
Name Age
0 1 2
1 3 4
2 4 5
>>> df
Name Age
0 1 2
1 3 4
0 4 5
>>> df1
Name Age
0 5 6
1 3 4
0 4 7
What I actually expected from the merge was
Name Age Age_#
0 1 2 NaN
1 3 4 4.0
2 4 5 7.0
Why does pandas think that there all three matching rows for this merge?
So you mean merge on Name right ?
df.merge(df1, how='left', on='Name', suffixes=('', '_#'))
Out[120]:
Name Age Age_#
0 1 2 NaN
1 3 4 4.0
2 4 5 7.0
Using indicator to see what is your output
df.merge(df1, how='left', left_on=['Name', 'Age'], right_on=['Name', 'Age'], suffixes=('', '_#'),indicator=True)
Out[121]:
Name Age _merge
0 1 2 left_only
1 3 4 both
2 4 5 left_only
Since you df and df1 have the same columns and all of the columns had been used as merge key , so there is not other columns indicate whether they share the same items in df or not (since you using the left , so that the default is show all left items in the result ).

Loop over groups Pandas Dataframe and get sum/count

I am using Pandas to structure and process Data.
This is my DataFrame:
And this is the code which enabled me to get this DataFrame:
(data[['time_bucket', 'beginning_time', 'bitrate', 2, 3]].groupby(['time_bucket', 'beginning_time', 2, 3])).aggregate(np.mean)
Now I want to have the sum (Ideally, the sum and the count) of my 'bitrates' grouped in the same time_bucket. For example, for the first time_bucket((2016-07-08 02:00:00, 2016-07-08 02:05:00), it must be 93750000 as sum and 25 as count, for all the case 'bitrate'.
I did this :
data[['time_bucket', 'bitrate']].groupby(['time_bucket']).agg(['sum', 'count'])
And this is the result :
But I really want to have all my data in one DataFrame.
Can I do a simple loop over 'time_bucket' and apply a function which calculate the sum of all bitrates ?
Any ideas ? Thx !
I think you need merge, but need same levels of indexes of both DataFrames, so use reset_index. Last get original Multiindex by set_index:
data = pd.DataFrame({'A':[1,1,1,1,1,1],
'B':[4,4,4,5,5,5],
'C':[3,3,3,1,1,1],
'D':[1,3,1,3,1,3],
'E':[5,3,6,5,7,1]})
print (data)
A B C D E
0 1 4 3 1 5
1 1 4 3 3 3
2 1 4 3 1 6
3 1 5 1 3 5
4 1 5 1 1 7
5 1 5 1 3 1
df1 = data[['A', 'B', 'C', 'D','E']].groupby(['A', 'B', 'C', 'D']).aggregate(np.mean)
print (df1)
E
A B C D
1 4 3 1 5.5
3 3.0
5 1 1 7.0
3 3.0
df2 = data[['A', 'C']].groupby(['A'])['C'].agg(['sum', 'count'])
print (df2)
sum count
A
1 12 6
print (pd.merge(df1.reset_index(['B','C','D']), df2, left_index=True, right_index=True)
.set_index(['B','C','D'], append=True))
E sum count
A B C D
1 4 3 1 5.5 12 6
3 3.0 12 6
5 1 1 7.0 12 6
3 3.0 12 6
I try another solution to get output from df1, but this is aggregated so it is impossible get right data. If sum level C, you get 8 instead 12.

How to extract values of one dataframe with values of other dataframe in pandas?

Suppose that you create the next python pandas data frames:
In[1]: print df1.to_string()
ID value
0 1 a
1 2 b
2 3 c
3 4 d
In[2]: print df2.to_string()
Id_a Id_b
0 1 2
1 4 2
2 2 1
3 3 3
4 4 4
5 2 2
How can I create a frame df_ids_to_values with the next values:
In[2]: print df_ids_to_values.to_string()
value_a value_b
0 a b
1 d b
2 b a
3 c c
4 d d
5 b b
In other words, I would like to replace the id's of df2 with the corresponding values in df1. I have tried doing this by performing a for loop but it is very slow and I am hopping that there is a function in pandas that allow me to do this operation very efficiently.
Thanks for your help...
Start by setting an index on df1
df1 = df1.set_index('ID')
then join the two columns
df = df2.join(df1, on='Id_a')
df = df.rename(columns = {'value' : 'value_a'})
df = df.join(df1, on='Id_b')
df = df.rename(columns = {'value' : 'value_b'})
result:
> df
Id_a Id_b value_a value_b
0 1 2 a b
1 4 2 d b
2 2 1 b a
3 3 3 c c
4 4 4 d d
5 2 2 b b
[6 rows x 4 columns]
(and you get to your expected output with df[['value_a','value_b']])

Categories