Pandas - Replace row values based on multi-column match [duplicate] - python

This question already has answers here:
Pandas Merging 101
(8 answers)
Closed 2 years ago.
This is a simple question but most of the solution I found here were based on one column match (mainly only ID).
Df1
'Name' 'Dept' 'Amount' 'Leave'
ABC 1 10 0
BCD 1 5 0
Df2
'Alias_Name', 'Dept', 'Amount', 'Leave', 'Address', 'Join_Date'
ABC 1 100 5 qwerty date1
PQR 2 0 2 asdfg date2
I want to replaces row values in df1 when both the Name and Dept are matched.
I tried merge(left_on=['Name', 'Dept'], right_on=['Alias_Name', 'Dept'], how='left') but it gives me double number of columns with _x and _y suffix. I just need to replaces the Dept, Amount, Leave in df1 if the Name and Dept are matched with any row in df2.
Desired Output:
Name Dept Amount Leave
ABC 1 100 5
BCD 1 5 0

new_df = df1[['Name', 'Dept']].merge(df2[['Alias_Name', 'Dept', 'Amount', 'Leave']].rename(columns={'Alias_Name': 'Name'}), how='left').fillna(df1[['Amount', 'Leave']])
Result:
Name Dept Amount Leave
0 ABC 1 100.0 5.0
1 BCD 1 5.0 0.0
You can use new_df[['Amount', 'Leave']] = new_df[['Amount', 'Leave']].astype(int) to re-cast the dtype if that's important.

You can create a temp column in both data frames which will be sum of both "Name" and "Dept". That column can be used as primary key to match

Try:
# select rows that should be replace
replace_df = df1[['Name', 'Dept']].merge(df2, left_on=['Name', 'Dept'], right_on=['Alias_Name', 'Dept'], how='inner')
# replace rows in df1
df1.iloc[replace_df.index] = replace_df
Result:
Name Dept Amount Leave
0 ABC 1 100 5
1 BCD 1 5 0

Related

Combining two pandas dataframes into one based on conditions

I got two dataframes, simplified they look like this:
Dataframe A
ID
item
1
apple
2
peach
Dataframe B
ID
flag
price ($)
1
A
3
1
B
2
2
B
4
2
A
2
ID: unique identifier for each item
flag: unique identifier for each vendor
price: varies for each vendor
In this simplified case I want to extract the price values of dataframe B and add them to dataframe A in separate columns depending on their flag value.
The result should look similar to this
Dataframe C
ID
item
price_A
price_B
1
apple
3
2
2
peach
2
4
I tried to split dataframe B into two dataframes the different flag values and merge them afterwards with dataframe A, but there must be an easier solution.
Thank you in advance! :)
*edit: removed the pictures
You can use pd.merge and pd.pivot_table for this:
df_C = pd.merge(df_A, df_B, on=['ID']).pivot_table(index=['ID', 'item'], columns='flag', values='price')
df_C.columns = ['price_' + alpha for alpha in df_C.columns]
df_C = df_C.reset_index()
Output:
>>> df_C
ID item price_A price_B
0 1 apple 3 2
1 2 peach 2 4
(dfb
.merge(dfa, on="ID")
.pivot_table(index=['ID', 'item'], columns='flag', values='price ($)')
.add_prefix("price_")
.reset_index()
)

Adding rows with value "0" for missing rows in python [duplicate]

This question already has answers here:
How can I fill in a missing values in range with Pandas?
(2 answers)
Closed last year.
I want to add missing rows based on the column "id" in a dataframe. The id should be continuous integers, starting from 1 to 60000. A small example is as follows: id ranges from 1 to 5. So I need to add 1,3,4 with value "0"s for the table below.
id
value1
value2
2
13
33
5
45
24
The final dataframe would become
id
value1
value2
1
0
0
2
13
33
3
0
0
4
0
0
5
45
24
You can set column 'id' as index, then use reindex method to conform df to new index with index from 1 to 5. The reindex method places NaN values in locations that had no values in the previous index, so you use fillna method to fill these with 0s, then reset the index and finally cast df to int dtype:
df = df.set_index('id').reindex(range(1,6)).fillna(0).reset_index().astype(int)
Output:
id value1 value2
0 1 0 0
1 2 13 33
2 3 0 0
3 4 0 0
4 5 45 24
You may want to look at the DataFrame.append method: https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.append.html
It adds rows to a DataFrame
You could use something like the following:
for i in [1, 3, 4]:
df = df.append({'id':i, 'value1': 0, 'value2': 0}, ignore_index=True)
If you want them to be in order by id afterwards, you could sort it:
df.sort_values(by=['id'], inplace=True)

Concatenating data from two files

There are 2 files opened with Pandas. If there are common parts in the first column of two files (colored letters), I want to paste the data of the second column of second file into the matched part of the first file. And if there is no match, I want to write 'NaN'. Is there a way I can do in this situation?
File1
enter code here
0 1
0 JCW 574
1 MBM 4212
2 COP 7424
3 KVI 4242
4 ECX 424
File2
enter code here
0 1
0 G=COP d4ssd5vwe2e2
1 G=DDD dfd23e1rv515j5o
2 G=FEW cwdsuve615cdldl
3 G=JCW io55i5i55j8rrrg5f3r
4 G=RRR c84sdw5e5vwldk455
5 G=ECX j4ut84mnh54t65y
File1#
enter code here
0 1 2
0 JCW 574 io55i5i55j8rrrg5f3r
1 MBM 4212 NaN
2 COP 7424 d4ssd5vwe2e2
3 KVI 4242 NaN
4 ECX 424 j4ut84mnh54t65y
Use Series.str.extract for new Series for matched values by df1[0] values first and then merge with left join in DataFrame.merge:
df1 = pd.read_csv(file1)
df2 = pd.read_csv(file2)
s = df2[0].str.extract(f'({"|".join(df1[0])})', expand=False)
df = df1.merge(df2[[1]], how='left', left_on=0, right_on=s)
df.columns = np.arange(len(df.columns))
print (df)
0 1 2
0 JCW 574 io55i5i55j8rrrg5f3r
1 MBM 4212 NaN
2 COP 7424 d4ssd5vwe2e2
3 KVI 4242 NaN
4 ECX 424 j4ut84mnh54t65y
Or if need match last 3 values of column df1[0] use:
s = df2[0].str.extract(f'({"|".join(df1[0].str[-3:])})', expand=False)
df = df1.merge(df2[[1]], how='left', left_on=0, right_on=s)
df.columns = np.arange(len(df.columns))
print (df)
Have a look at the concat-function of pandas using join='outer' (https://pandas.pydata.org/pandas-docs/stable/user_guide/merging.html). There is also this question and the answer to it that can help you.
It involves reindexing each of your data frames to use the column that is now called "0" as the index, and then joining two data frames based on their indices.
Also, can I suggest that you do not paste an image of your dataframes, but upload the data in a form that other people can test their suggestions.

Returning the rows based on specific value without column name

I know how to return the rows based on specific text by specifying the column name like below.
import pandas as pd
data = {'id':['1', '2', '3','4'],
'City1':['abc','def','abc','khj'],
'City2':['JH','abc','abc','yuu'],
'City2':['JRR','ytu','rr','abc']}
df = pd.DataFrame(data)
df.loc[df['City1']== 'abc']
and output is -
id City1 City2
0 1 abc JRR
2 3 abc rr
but what i need is -my specific value 'abc' can be in any columns and i need to return rows values that has specific text eg 'abc' without giving column name. Is there any way? need output as below
id City1 City2
0 1 abc JRR
1 3 abc rr
2 4 khj abc
You can use any with the (1) parameter to apply it on all columns to get the expected result :
>>> df[(df == 'abc').any(1)]
id City1 City2
0 1 abc JRR
2 3 abc rr
3 4 khj abc

Pandas how to aggregate more than one column

Here is the snippet:
test = pd.DataFrame({'userid': [1,1,1,2,2], 'order_id': [1,2,3,4,5], 'fee': [2,1,5,3,1]})
I'd like to group based on userid and count the 'order_id' column and sum the 'fee' column:
test.groupby('userid').order_id.count()
test.groupby('userid').fee.sum()
Is it possible to perform these two operations in one line of code so that I can get a resulting df looks like this:
userid counts sum
...
I've tried pivot_table:
test.pivot_table(index='userid', values=['order_id', 'fee'], aggfunc=[np.size, np.sum])
It gives something like this:
size sum
fee order_id fee order_id
userid
1 3 3 8 6
2 2 2 4 9
Is it possible to tell pandas to use np.size & np.sum on one column but not both?
Use DataFrameGroupBy.agg with rename columns:
d = {'order_id':'counts','fee':'sum'}
df = test.groupby('userid').agg({'order_id':'count', 'fee':'sum'})
.rename(columns=d)
.reset_index()
print (df)
userid sum counts
0 1 8 3
1 2 4 2
But better is aggregate by size, because count is used if need exclude NaNs:
df = test.groupby('userid')
.agg({'order_id':'size', 'fee':'sum'})
.rename(columns=d).reset_index()
print (df)
userid sum counts
0 1 8 3
1 2 4 2

Categories