Appending two dataframes with multiindex - python

I have two dataframes, each with a multiindex. The multiindex levels share names, but are in a different order. When I append or concat, I would expect pandas to line up the indices just like it aligns index-less columns before appending. Is there a function or an argument I can pass to append or concat to get this to work in the way I desire (and that I think ought to be standard)?
import pandas as pd
df1 = pd.DataFrame(data = {'Name':['Bob','Ann','Sally'], 'Acct':['Savings','Savings','Checking'], 'Value':[101,102,103]})
df1 = df1.set_index(['Name','Acct'])
print(df1)
df2 = pd.DataFrame(data = {'Acct':['Savings','Savings','Checking'], 'Name':['Bob','Ann','Sally'], 'Value':[201,202,203]})
df2 = df2.set_index(['Acct','Name'])
print(df2)
print(df1.append(df2))
print(pd.concat([df1,df2]))
Value
Name Acct
Bob Savings 101
Ann Savings 102
Sally Checking 103
Value
Acct Name
Savings Bob 201
Ann 202
Checking Sally 203
Value
Name Acct
Bob Savings 101
Ann Savings 102
Sally Checking 103
Savings Bob 201
Ann 202
Checking Sally 203
Value
Name Acct
Bob Savings 101
Ann Savings 102
Sally Checking 103
Savings Bob 201
Ann 202
Checking Sally 203
As you can see, after appending or concatenating, my combined index appears to show that, for example, "Sally" is an account, not a name. I'm aware that if I put the index levels in the same order when setting index, I'll get what I want, and that I could reset the index on the frames to align them, but I'm hoping there's a more intuitive way to get the indices to align on name, not on position.

Somewhat of a work around, you can reset_index on both data sets, concat them, then set_index:
print(pd.concat([
df1.reset_index(),
df2.reset_index()
], sort=False).set_index([
'Name',
'Acct'
]))
Value
Name Acct
Bob Savings 101
Ann Savings 102
Sally Checking 103
Bob Savings 201
Ann Savings 202
Sally Checking 203
Though I'm not sure why you would want to have multiple rows with the same index...

Related

Add values in columns if criteria from another column is met

I have the following DataFrame
import pandas as pd
d = {'Client':[1,2,3,4],'Salesperson':['John','John','Bob','Richard'],
'Amount':[1000,1000,0,500],'Salesperson 2':['Bob','Richard','John','Tom'],
'Amount2':[400,200,300,500]}
df = pd.DataFrame(data=d)
Client
Salesperson
Amount
Salesperson
Amount2
1
John
1000
Bob
400
2
John
1000
Richard
200
3
Bob
0
John
300
4
Richard
500
Tom
500
And I just need to create some sort of "sumif" statement (the one from excel) that will add the amount each salesperson is due. I don't know how to iterate over each row, but I want to have it so that it adds the values in "Amount" and "Amount2" for each one of the salespersons.
Then I need to be able to see the amount per salesperson.
Expected Output (Ideally in a DataFrame as well)
Sales Person
Total Amount
John
2300
Bob
400
Richard
700
Tom
500
There can be multiple ways of solving this. One option is to use Pandas Concat to join required columns and use groupby
merged_df = pd.concat([df[['Salesperson','Amount']], df[['Salesperson 2', 'Amount2']].rename(columns={'Salesperson 2':'Salesperson','Amount2':'Amount'})])
merged_df.groupby('Salesperson',as_index = False)['Amount'].sum()
you get
Salesperson Amount
0 Bob 400
1 John 2300
2 Richard 700
3 Tom 500
Edit: If you have another pair of salesperson/amount, you can add that to the concat
d = {'Client':[1,2,3,4],'Salesperson':['John','John','Bob','Richard'],
'Amount':[1000,1000,0,500],'Salesperson 2':['Bob','Richard','John','Tom'],
'Amount2':[400,200,300,500], 'Salesperson 3':['Nick','Richard','Sam','Bob'],
'Amount3':[400,800,100,400]}
df = pd.DataFrame(data=d)
merged_df = pd.concat([df[['Salesperson','Amount']], df[['Salesperson 2', 'Amount2']].rename(columns={'Salesperson 2':'Salesperson','Amount2':'Amount'}), df[['Salesperson 3', 'Amount3']].rename(columns={'Salesperson 3':'Salesperson','Amount3':'Amount'})])
merged_df.groupby('Salesperson',as_index = False)['Amount'].sum()
Salesperson Amount
0 Bob 800
1 John 2300
2 Nick 400
3 Richard 1500
4 Sam 100
5 Tom 500
Edit 2: Another solution using pandas wide_to_long
df = df.rename({'Salesperson':'Salesperson 1','Amount':'Amount1'}, axis='columns')
reshaped_df = pd.wide_to_long(df, stubnames=['Salesperson','Amount'], i='Client',j='num', suffix='\s?\d+').reset_index(drop = 1)
The above will reshape df,
Salesperson Amount
0 John 1000
1 John 1000
2 Bob 0
3 Richard 500
4 Bob 400
5 Richard 200
6 John 300
7 Tom 500
8 Nick 400
9 Richard 800
10 Sam 100
11 Bob 400
A simple groupby on reshaped_df will give you required output
reshaped_df.groupby('Salesperson', as_index = False)['Amount'].sum()
One option is to tidy the dataframe into long form, where all the Salespersons are in one column, and the amounts are in another, then you can groupby and get the aggregate.
Let's use pivot_longer from pyjanitor to transform to long form:
# pip install pyjanitor
import pandas as pd
import janitor
(df
.pivot_longer(
index="Client",
names_to=".value",
names_pattern=r"([a-zA-Z]+).*",
)
.groupby("Salesperson", as_index = False)
.Amount
.sum()
)
Salesperson Amount
0 Bob 400
1 John 2300
2 Richard 700
3 Tom 500
The .value tells the function to keep only those parts of the column that match it as headers. The columns have a pattern (They start with a text - either Salesperson or Amount - and either have a number at the end ( or not). This pattern is captured in names_pattern. .value is paired with the regex in the brackets, those outside do not matter in this case.
Once transformed into long form, it is easy to groupby and aggregate. The as_index parameter allows us to keep the output as a dataframe.

Efficiently creating frequency and recency columns

This is a very specific problem - my code is very slow, wonder if I'm doing something obviously wrong or there's a better way.
The situation: I have two dataframes, frame and contacts. frame is a database of people, and contacts is points of contact with these people. They look something like:
frame:
name
id
166 Bob
253 Serge
1623 Anna
766 Benna
981 Paul
contacts:
id type date
0 253 email 2016-01-05
1 1623 sale 2012-05-12
2 1623 email 2017-12-22
3 253 sale 2018-02-15
I want to add two columns to frame, 'most_recent' and '3 year contact count', which give the most recent contact (if there is one) and the number of contacts in the past 3 years.
(frame is ~100,000 rows, and contacts is ~95,000)
So far, I'm reducing the amount of ids to iterate over, then creating a dict for each id with the right values:
id_list = [i for i in frame.index if i in contacts['id']]
freq_rec_dict = {i: [contacts.loc[contacts['id']==i,'value'].max(),
len(contacts.loc[(contacts['id']==i)&(contacts['value']>dt(2016,1,1))])]
for i in id_list}
Then, I turn the dict into a dataframe and perform a join:
freq_rec_df = pd.DataFrame.from_dict(freq_rec_dict, orient='index',columns=['most_recent','3 year contact count'])
result = frame.join(freq_rec_df)
This does give me what I need, but the dictionary comprehension took 30 minutes - I feel like there must be a more efficient way to do this (I will need this in the future). Any ideas would be much appreciated - thanks!
You don't specify your output, but here goes. You should leverage the built-in groupby method instead of taking your data out of a frame and back into a frame and then merging
contacts.groupby('id')[['date','type']].max()
date type
id
253 2018-02-15 sale
1623 2017-12-22 sale
Which you can do in one line if you need to save memory space. Again, you don't give a preferred output, so I used a left join. You could also use 'inner' to keep only rows where records exist.
df=pd.merge(frame,contacts.groupby('id')[['date','type']].max(), left_index=True, right_index=True, how='left')
name date type
id
166 Bob NaN NaN
253 Serge 2018-02-15 sale
1623 Anna 2017-12-22 sale
766 Benna NaN NaN
981 Paul NaN NaN

Pandas - Filter rows where columns match atleast once

I have a pandas dataframe as shown below:
Name ID1 ID2
Joe 248 248
Joe 248 326
Joe 721 248
Anna 295 295
Bob 721 248
Bob 721 326
Bob 248 566
I need to keep only the rows that do not have matching ID1 & ID2,
with the exception that, if both the IDs matched at least once for a Name, then drop them.
For example:
For Name = Joe, IDs match once (248), so remove all rows with Joe.
For Name = Bob, IDs never match, so keep all rows with Bob.
So far, I've tried:
Dropping duplicates by sorting names and checking if IDs match or not. But this does not take into account IDs matching at least once.
df = df.sort_values(['Name']).drop_duplicates(['Name'],keep='first')
Not sure if pandas can drop duplicates with condition where something matches 'atleast once'.
If I understand correctly, you can calculate the names to remove and then use Boolean indexing:
names_to_remove = df.loc[df['ID1'] == df['ID2'], 'Name'].values
res = df[~df['Name'].isin(names_to_remove)]
print(res)
Name ID1 ID2
4 Bob 721 248
5 Bob 721 326
6 Bob 248 566
df.groupby('Name').apply(lambda grp: grp if not (grp['ID1'] == grp['ID2']).any() else None).dropna()
Explanation: Groupby Name, then if there is any index for which ID1 and Id2 do NOT match, return the group. Else, return None and then drop the null columns.

Comparing columns from two data frames

I am relatively new to Python. If I have the following two types of dataframes, Lets say df1 and df2 respectively.
Id Name Job Name Salary Location
1 Jim Tester Jim 100 Japan
2 Bob Developer Bob 200 US
3 Sam Support Si 300 UK
Sue 400 France
I want to compare the 'Name' column in df2 to df1 such that if the name of the person (in df2) does not exist in df1 than that row in df2 would be outputed to another dataframe. So for the eg above the output would be:
Name Salary Location
Si 300 UK
Sue 400 France
Si and Sue are outputed because they do not exist in the 'Name' column in df1.
You can use Boolean indexing:
res = df2[~df2['Name'].isin(df1['Name'].unique())]
We use hashing via pd.Series.unique as an optimization in case you have duplicate names in df1.

How to remove rows from a dataframe based on another

I have been try my level best to compare two data frames in a specific manner but not successful. I hope experts over here can help with solution.
Below is my problem description:
I have two dataframes.
Data frame #1 looks like this.
df1:
pid name age
121 John 36
132 Mary 26
132 Jim 46
145 Kim 50
Dataframe#2 looks like below
df2:
pid name age
121 John 32
132 Tom 28
132 Susan 40
155 Kim 50
I want to compare both df's in such a way that those rows in df2 which don't have the same pid's in df1 should be deleted.
My new data frame #2 should look like below
df2:
pid name age
121 John 32
132 Tom 28
132 Susan 40
Highly appreciate your help on this.
You could use isin as in
df2[df2.pid.isin(df1.pid)]
which will return only the rows of df2 whose pid is in df1.

Categories