This question already has answers here:
Pandas Merging 101
(8 answers)
Closed 2 years ago.
I have two dataframes with different sizes and I want to merge them.
It's like an "update" to a dataframe column based on another dataframe with different size.
This is an example input:
dataframe 1
CODUSU Situação TIPO1
0 1AB P A0
1 2C3 C B1
2 3AB P C1
dataframe 2
CODUSU Situação ABC
0 1AB A 3
1 3AB A 4
My output should be like this:
dataframe 3
CODUSU Situação TIPO1
0 1AB A A0
1 2C3 C B1
2 3AB A C1
PS: I did it through loop but I think there should better and easier way to make it!
I read this content: pandas merging 101 and wrote this code:
df3=df1.merge(df2, on=['CODUSU'], how='left', indicator=False)
df3['Situação'] = np.where((df3['Situação_x'] == 'P') & (df3['Situação_y'] == 'A') , df3['Situação_y'] , df3['Situação_x'])
df3=df3.drop(columns=['Situação_x', 'Situação_y','ABC'])
df3 = df3[['CODUSU','Situação','TIPO1']]
And Voilà, df3 is exactly what I needed!
Thanks for everyone!
PS: I already found my answer, is there a better place to answer my own question?
df1.merge(df2,how='left', left_on='CODUSU', right_on='CODUSU')
This should do the trick.
Also, worth noting that if you want your resultant data frame to not contain the column ABC, you'd use df2.drop("ABC") instead of just df2.
Related
This question already has answers here:
Finding the difference between two dataframes having duplicate records
(2 answers)
Closed 1 year ago.
I have two Dataframes as below:
'''
df1 =
emp_id
emp_name
e_city
1
Joe
Acity
2
Nick
Bcity
3
Sam
Ccity
4
John
Dcity
5
Mike
Ecity
df2 =
emp_id
emp_name
e_city
2
Nick
Bcity
2
Nick
Bcity
3
Sam
Ccity
4
John
Dcity
'''
Please note df2 has a duplicate row and len of both DFs are not equal.
My use case is to find the mismatches or differences between these two DFs
expected output: - The row which is occurring only once in 1 DF and twice in another DF should be shown as a difference along with other mismatched values
df3 =
emp_id
emp_name
e_city
1
Joe
Acity
2
Nick
Bcity
5
Mike
Ecity
I tried below methods but nothing were fruitful.
I cannot use 'df.compare' since both dataframes are not of equal length.
I tried using 'df.merge' but it is not pointing the duplicated row as a mismatch/difference.
I tried to use 'concate' and 'compare'. That is not successful as well.
Can someone please help me on this? Thanks in advance
first you can count duplicate with groupby
df2 = df2.groupby(['emp_id','emp_name']).size().reset_index(name='count')
and then merge the dataframe with original
This question already has answers here:
Pandas groupby with delimiter join
(2 answers)
Closed 2 years ago.
I have a dataframe with a column (A) with duplicate values. But the other column (B) has unique values for each value in (A).
A B
0 6SP 6A
1 6SP 6B
2 6FR 6A
I want to drop the duplicates in col (A) but still retain all values in col (B) by concatenation. The result should look like
A B
0 6SP 6A,6B
2 6FR 6A
Is this possible? The dataset is not very big (approx. 1000 rows) so efficiency is not very important.
Sample code:
df = pd.DataFrame({"A":['6SP','6SP','6FR'], "B":['6A','6B','6A']})
Best regards,
David
Try this:
df.groupby('A')['B'].apply(", ".join)
This question already has answers here:
Pandas grouby and transform('count') gives placement error - works fine on smaller dataset
(1 answer)
Merging a pandas groupby result back into DataFrame
(3 answers)
Closed 4 years ago.
I was wondering if anyone knew of a better method to what I am currently doing. Here is an example data set:
ID Number
a 1
a 2
a 3
b 4
c 5
c 6
c 7
c 8
Example: if I wanted to get a count of Numbers by ID column in the table above. I would first do a groupby ID and do a count on Number, then merge the results back to the original table like so:
df2 = df.groupby('ID').agg({'Number':'count'}).reset_index()
df2 = df2.rename(columns = {'Number':'Number_Count'})
df = pd.merge(df, df2, on = ['ID'])
This results in:
It feels like a roundabout way of doing this, does anyone know a better alternative? The reason I ask is because when working with large data sets, this method can chew up a lot of memory (by creating another table and then merging them).
You can do that quite simply with this:
import pandas as pd
df = pd.DataFrame({'ID': list('aaabcccc'),
'Number': range(1,9)})
df['Number_Count'] = df.groupby('ID').transform('count')
df
# ID Number Number_Count
#0 a 1 3
#1 a 2 3
#2 a 3 3
#3 b 4 1
#4 c 5 4
#5 c 6 4
#6 c 7 4
#7 c 8 4
This question already has answers here:
How to remove nan value while combining two column in Panda Data frame?
(5 answers)
Closed 4 years ago.
I have a pretty simple Pandas question that deals with merging two series. I have two series in a dataframe together that are similar to this:
Column1 Column2
0 Abc NaN
1 NaN Abc
2 Abc NaN
3 NaN Abc
4 NaN Abc
The answer will probably end up being a really simple .merge() or .concat() command, but I'm trying to get a result like this:
Column1
0 Abc
1 Abc
2 Abc
3 Abc
4 Abc
The idea is that for each row, there is a string of data in either Column1, Column2, but never both. I did about 10 minutes of looking for answers on StackOverflow as well as Google, but I couldn't find a similar question that cleanly applied to what I was looking to do.
I realize that a lot of this question just stems from my ignorance on the three functions that Pandas has to stick series and dataframes together. Any help is very much appreciated. Thank you!
You can just use pd.Series.fillna:
df['Column1'] = df['Column1'].fillna(df['Column2'])
Merge or concat are not appropriate here; they are used primarily for combining dataframes or series based on labels.
Use groupby with first
df.groupby(df.columns.str[:-1],axis=1).first()
Out[294]:
Column
0 Abc
1 Abc
2 Abc
3 Abc
4 Abc
Or :
`ndf = pd.DataFrame({'Column1':df.fillna('').sum(1)})`
This question already has answers here:
Concatenate rows of two dataframes in pandas
(3 answers)
Closed 5 years ago.
I have two Pandas DataFrames, each with different columns. I want to basically glue them together horizontally (they each have the same number of rows so this shouldn't be an issue).
There must be a simple way of doing this but I've gone through the docs and concat isn't what I'm looking for (I don't think).
Any ideas?
Thanks!
concat is indeed what you're looking for, you just have to pass it a different value for the "axis" argument than the default. Code sample below:
import pandas as pd
df1 = pd.DataFrame({
'A': [1,2,3,4,5],
'B': [1,2,3,4,5]
})
df2 = pd.DataFrame({
'C': [1,2,3,4,5],
'D': [1,2,3,4,5]
})
df_concat = pd.concat([df1, df2], axis=1)
print(df_concat)
With the result being:
A B C D
0 1 1 1 1
1 2 2 2 2
2 3 3 3 3
3 4 4 4 4
4 5 5 5 5