merging two dataframe on column and index - python

Hi so I have two dataframes, first one is a dataframe which was created by grouping by another df by id (which is index now) and then sorting by 'due' column.
df1:
paid due
id
3 13.000000 5.000000
2 437.000000 5.000000
5 90.000000 5.000000
1 60.000000 5.000000
4 675.000000 5.000000
The other one is a normal dataframe which has 3 columns: 'id' 'name' and 'country'.
df2:
id name country
1 'AB' 'DE'
2 'CD' 'DE'
3 'EF' 'NL'
4 'HAH' 'SG'
5 'NOP' 'NOR'
So what I was trying to do is to add the 'name' column to the 1st dataframe based on the id number (which is index in first df and column in second one).
So I thought this code would work:
pd.merge(df1, df2['name'], left_index=True, right_on='id')
But I get error
ValueError: can not merge DataFrame with instance of type <class 'pandas.core.series.Series'>

You can use rename for map by dict:
df1['name'] = df1.rename(index=df2.set_index('id')['name']).index
print (df1)
paid due name
id
3 13.0 5.0 'EF'
2 437.0 5.0 'CD'
5 90.0 5.0 'NOP'
1 60.0 5.0 'AB'
4 675.0 5.0 'HAH'

You might find that pd.concat is a better option here because it can accept a mix of dataframe and series: http://pandas.pydata.org/pandas-docs/stable/merging.html#concatenating-with-mixed-ndims.

Okay so I figured out that I can't really get one column of dataframe in that way but I can remake df2 so that it contains only one needed column:
df2=df2[['id', 'name']]
pd.merge(df1, df2, left_index=True, right_on='id')
And there is no error anymore.

Related

pandas, update dataframe values ​with a not in the same format dataframe

i have two dataframes. The second dataframe contains the values ​​to be updated in the first dataframe. df1:
data=[[1,"potential"],[2,"lost"],[3,"at risk"],[4,"promising"]]
df=pd.DataFrame(data,columns=['id','class'])
id class
1 potential
2 lost
3 at risk
4 promising
df2:
data2=[[2,"new"],[4,"loyal"]]
df2=pd.DataFrame(data2,columns=['id','class'])
id class
2 new
4 loyal
expected output:
data3=[[1,"potential"],[2,"new"],[3,"at risk"],[4,"loyal"]]
df3=pd.DataFrame(data3,columns=['id','class'])
id class
1 potential
2 new
3 at risk
4 loyal
The code below seems to be working, but I believe there is a more effective solution.
final=df.append([df2])
final = final.drop_duplicates(subset='id', keep="last")
addition:
Is there a way for me to write the previous value in a new column?
like this:
id class prev_class modified date
1 potential nan nan
2 new lost 2022.xx.xx
3 at risk nan nan
4 loyal promising 2022.xx.xx
Your solution is good, here is alternative with concat and added DataFrame.sort_values:
df = (pd.concat([df, df2])
.drop_duplicates(subset='id', keep="last")
.sort_values('id', ignore_index=True))
print (df)
id class
0 1 potential
1 2 new
2 3 at risk
3 4 loyal
Solution is change if need add previous class values and today:
df3 = pd.concat([df, df2])
mask = df3['id'].duplicated(keep='last')
df31 = df3[mask]
df32 = df3[~mask]
df3 = (df32.merge(df31, on='id', how='left', suffixes=('','_prev'))
.sort_values('id', ignore_index=True))
df3.loc[df3['class_prev'].notna(), 'modified date'] = pd.to_datetime('now').normalize()
print (df3)
id class class_prev modified date
0 1 potential NaN NaT
1 2 new lost 2022-03-31
2 3 at risk NaN NaT
3 4 loyal promising 2022-03-31
We can use DataFrame.update
df = df.set_index('id')
df.update(df2.set_index('id'))
df = df.reset_index()
Result
print(df)
id class
0 1 potential
1 2 new
2 3 at risk
3 4 loyal
You can operate along your id's by setting them as your index, and use combine_first to perform this operation. Then assigning youre prev_class is extremely straightforward because you've properly used the Index!
df = df.set_index('id')
df2 = df2.set_index('id')
out = (
df2.combine_first(df)
.assign(
prev_class=df2["class"],
modified=lambda d:
d["prev_class"].where(
d["prev_class"].isna(), pd.Timestamp.now()
)
)
)
print(out)
class prev_class modified
id
1 potential NaN NaN
2 new new 2022-03-31 06:51:20.832668
3 at risk NaN NaN
4 loyal loyal 2022-03-31 06:51:20.832668

Pandas merging dataframes and overwriting the data in the original df

I'm trying to merge two pandas dataframes but I can't figure out how to get the result I need. These are the example versions of dataframes I'm looking at:
df1 = pd.DataFrame([["09/10/2019",None],["10/10/2019",None], ["11/10/2019",6],
["12/10/2019",5], ["13/10/2019",3], ["14/10/2019",3],
["15/10/2019",5],
["16/10/2019",None]], columns = ['Date', 'A'])
df2 = pd.DataFrame([["10/10/2019",3], ["11/10/2019",5], ["12/10/2019",6],
["13/10/2019",1], ["14/10/2019",2], ["15/10/2019",4]],
columns = ['Date', 'A'])
I have checked the Pandas merging 101 but still can't find the way to do it correctly. Essentially what I need using the same graphics as in the guide is this:
i.e. I want to keep the data from df1 that falls outside the shared keys section, but within shared area I want df2 data from column 'A' to overwrite data from df1. I'm not even sure that merge is the right tool to use.
I've tried using df1 = pd.merge(df1, df2, how='right', on='Date') with different options, but in most cases it creates two separate columns - A_x and A_y in the output.
This is what I want to get as the end result:
Date A
0 09/10/2019 NaN
1 10/10/2019 3.0
2 11/10/2019 5.0
3 12/10/2019 6.0
4 13/10/2019 1.0
5 14/10/2019 2.0
6 15/10/2019 4.0
7 16/10/2019 NaN
Thanks in advance!
here is a way using combine_first:
df2.set_index('Date').combine_first(df1.set_index('Date')).reset_index()
Or reindex_like:
df2.set_index('Date').reindex_like(df1.set_index('Date')).reset_index()
Date A
0 09/10/2019 NaN
1 10/10/2019 3.0
2 11/10/2019 5.0
3 12/10/2019 6.0
4 13/10/2019 1.0
5 14/10/2019 2.0
6 15/10/2019 4.0
7 16/10/2019 NaN

Python: Splitting a column into two Columns based off its value

I am trying to get from My Starting DataFrame
to My Desired Results
.
I am trying to do a groupby on two columns (Name, Month) and I have a column (Category) that has either the value 'Score1' or 'Score2'. I want to create two columns with the name of values from the Category column and set their values to a value determined from another column.
pd.crosstab([df.Name, df.Month], df.Category)
is the closest I've got to create the desire data frame however I can't figure out how to get the values from my "Value" column to populate the dataframe.
Results from crosstab
The Dataframe in code form
df = pd.DataFrame(columns=['Name', 'Month', 'Category', 'Value'])
df['Name'] = ['Jack','Jack','Sarah','Sarah','Zack']
df['Month'] = ['Jan.','Jan.','Feb.','Feb.','Feb.']
df['Category'] = ['Score1','Score2','Score1','Score2','Score1']
df['Value'] = [1,2,3,4,5]
Thanks!
You can use Pivot Table
df.pivot_table(index=['Name', 'Month'],values='Value', columns='Category').rename_axis(None, axis=1).reset_index()
Out[1]:
Name Month Score1 Score2
0 Jack Jan. 1.0 2.0
1 Sarah Feb. 3.0 4.0
2 Zack Feb. 5.0 NaN
one way is with groupby and unstack
new_df = (df.groupby(['Name','Month','Category'])
['Value'].first().unstack().reset_index())
print(new_df)
Category Name Month Score1 Score2
0 Jack Jan. 1.0 2.0
1 Sarah Feb. 3.0 4.0
2 Zack Feb. 5.0 NaN

Merging Dataframe with Different Dates?

I want to merge a seperate dataframe (df2) with the main dataframe (df1), but if, for a given row, the dates in df1 do not exist in df2, then search for the recent date before the underlying date in df1.
I tried to use pd.merge, but it would remove rows with unmatched dates, and only keep the rows that matched in both df's.
df1 = [['2007-01-01','A'],
['2007-01-02','B'],
['2007-01-03','C'],
['2007-01-04','B'],
['2007-01-06','C']]
df2 = [['2007-01-01','B',3],
['2007-01-02','A',4],
['2007-01-03','B',5],
['2007-01-06','C',3]]
df1 = pd.DataFrame(df1)
df2 = pd.DataFrame(df2)
df1[0] = pd.to_datetime(df1[0])
df2[0] = pd.to_datetime(df2[0])
Current df1 | pd.merge():
0 1 2
0 2007-01-06 C 3
Only gets the exact date between both df's, it does not consider value from recent dates.
Expected df1:
0 1 2
0 2007-01-01 A NaN
1 2007-01-02 B 3
2 2007-01-03 C NaN
3 2007-01-04 B 3
4 2007-01-06 C 3
Getting NaNs because data doesn't exist on or before that date in df2. For index row 1, it gets data before a day before, while index row 4, it gets data exactly on the same day.
Check you output by using merge_asof
pd.merge_asof(df1,df2,on=0,by=1,allow_exact_matches=True)
Out[15]:
0 1 2
0 2007-01-01 A NaN
1 2007-01-02 B 3.0
2 2007-01-03 C NaN
3 2007-01-04 B 5.0 # here should be 5 since 5 ' date is more close. also df2 have two B
4 2007-01-06 C 3.0
Using your merge code, which I assume you have since its not present in your question, insert the argument how=left or how=outer.
It should look like this:
dfmerged = pd.merge(df1, df2, how='left', left_on=['Date'], right_on=['Date'])
You can then use slicing and renaming to keep the columns you wish.
dfmerged = dfmerged[['Date', 'Letters', 'Numbers']]
Note: I do not know your column names since you haven't shown any code. Substitute as necessary

pandas how to outer join without creating new columns

I have 2 pandas dataframes like this:
date value
20100101 100
20100102 150
date value
20100102 150.01
20100103 180
The expected output should be:
date value
20100101 100
20100102 150
20100103 180
The 2nd dataframe always contains newest value that I'd like to add into the 1st dataframe. However, the value on the same day may differ slightly between the two dataframes. I would like to ignore the same dates and focus on adding the new date and value into the 1st dataframe.
I've tried outer join in pandas, but it gives me two columns value_x and value_y because the value are not essentially the same on same dates. Any solution to this?
I believe need concat with drop_duplicates:
df = pd.concat([df1,df2]).drop_duplicates('date', keep='last')
print (df)
date value
0 20100101 100.00
0 20100102 150.01
1 20100103 180.00
df = pd.concat([df1,df2]).drop_duplicates('date', keep='first')
print (df)
date value
0 20100101 100.0
1 20100102 150.0
1 20100103 180.0

Categories