Comparing columns from two data frames - python

I am relatively new to Python. If I have the following two types of dataframes, Lets say df1 and df2 respectively.
Id Name Job Name Salary Location
1 Jim Tester Jim 100 Japan
2 Bob Developer Bob 200 US
3 Sam Support Si 300 UK
Sue 400 France
I want to compare the 'Name' column in df2 to df1 such that if the name of the person (in df2) does not exist in df1 than that row in df2 would be outputed to another dataframe. So for the eg above the output would be:
Name Salary Location
Si 300 UK
Sue 400 France
Si and Sue are outputed because they do not exist in the 'Name' column in df1.

You can use Boolean indexing:
res = df2[~df2['Name'].isin(df1['Name'].unique())]
We use hashing via pd.Series.unique as an optimization in case you have duplicate names in df1.

Related

Add values in columns if criteria from another column is met

I have the following DataFrame
import pandas as pd
d = {'Client':[1,2,3,4],'Salesperson':['John','John','Bob','Richard'],
'Amount':[1000,1000,0,500],'Salesperson 2':['Bob','Richard','John','Tom'],
'Amount2':[400,200,300,500]}
df = pd.DataFrame(data=d)
Client
Salesperson
Amount
Salesperson
Amount2
1
John
1000
Bob
400
2
John
1000
Richard
200
3
Bob
0
John
300
4
Richard
500
Tom
500
And I just need to create some sort of "sumif" statement (the one from excel) that will add the amount each salesperson is due. I don't know how to iterate over each row, but I want to have it so that it adds the values in "Amount" and "Amount2" for each one of the salespersons.
Then I need to be able to see the amount per salesperson.
Expected Output (Ideally in a DataFrame as well)
Sales Person
Total Amount
John
2300
Bob
400
Richard
700
Tom
500
There can be multiple ways of solving this. One option is to use Pandas Concat to join required columns and use groupby
merged_df = pd.concat([df[['Salesperson','Amount']], df[['Salesperson 2', 'Amount2']].rename(columns={'Salesperson 2':'Salesperson','Amount2':'Amount'})])
merged_df.groupby('Salesperson',as_index = False)['Amount'].sum()
you get
Salesperson Amount
0 Bob 400
1 John 2300
2 Richard 700
3 Tom 500
Edit: If you have another pair of salesperson/amount, you can add that to the concat
d = {'Client':[1,2,3,4],'Salesperson':['John','John','Bob','Richard'],
'Amount':[1000,1000,0,500],'Salesperson 2':['Bob','Richard','John','Tom'],
'Amount2':[400,200,300,500], 'Salesperson 3':['Nick','Richard','Sam','Bob'],
'Amount3':[400,800,100,400]}
df = pd.DataFrame(data=d)
merged_df = pd.concat([df[['Salesperson','Amount']], df[['Salesperson 2', 'Amount2']].rename(columns={'Salesperson 2':'Salesperson','Amount2':'Amount'}), df[['Salesperson 3', 'Amount3']].rename(columns={'Salesperson 3':'Salesperson','Amount3':'Amount'})])
merged_df.groupby('Salesperson',as_index = False)['Amount'].sum()
Salesperson Amount
0 Bob 800
1 John 2300
2 Nick 400
3 Richard 1500
4 Sam 100
5 Tom 500
Edit 2: Another solution using pandas wide_to_long
df = df.rename({'Salesperson':'Salesperson 1','Amount':'Amount1'}, axis='columns')
reshaped_df = pd.wide_to_long(df, stubnames=['Salesperson','Amount'], i='Client',j='num', suffix='\s?\d+').reset_index(drop = 1)
The above will reshape df,
Salesperson Amount
0 John 1000
1 John 1000
2 Bob 0
3 Richard 500
4 Bob 400
5 Richard 200
6 John 300
7 Tom 500
8 Nick 400
9 Richard 800
10 Sam 100
11 Bob 400
A simple groupby on reshaped_df will give you required output
reshaped_df.groupby('Salesperson', as_index = False)['Amount'].sum()
One option is to tidy the dataframe into long form, where all the Salespersons are in one column, and the amounts are in another, then you can groupby and get the aggregate.
Let's use pivot_longer from pyjanitor to transform to long form:
# pip install pyjanitor
import pandas as pd
import janitor
(df
.pivot_longer(
index="Client",
names_to=".value",
names_pattern=r"([a-zA-Z]+).*",
)
.groupby("Salesperson", as_index = False)
.Amount
.sum()
)
Salesperson Amount
0 Bob 400
1 John 2300
2 Richard 700
3 Tom 500
The .value tells the function to keep only those parts of the column that match it as headers. The columns have a pattern (They start with a text - either Salesperson or Amount - and either have a number at the end ( or not). This pattern is captured in names_pattern. .value is paired with the regex in the brackets, those outside do not matter in this case.
Once transformed into long form, it is easy to groupby and aggregate. The as_index parameter allows us to keep the output as a dataframe.

Reference df1 to check validity in df2 and create a new column in pandas

I have a dataframe with three columns denoting three zones of countries a user can be subscribed.In each of the three columns there is a list of countries (some countries are in all three columns)
In another dataframe I have a list of users and the countries they are in.
The objective is to identify what zone the user is in, if any and remark that they are or are not allowed to use the service in that country.
df1 contains the country the user is in and the zone the user is subscribed to, as well as other fields.
df2 contains the zones available and the list of countries for that zone, as well as other fields.
df1.head()
name alias3 status_y country
Thetis Z1 active Romania
Demis Z1 active No_country
Donis Z1 active Sweden
Rhona Z3 active Germany
Theau Z2 active Bangladesh
df2.head()
Zone 1 Zone 2 Zone 3
ALBANIA ALBANIA ALBANIA
BELGIUM BELGIUM BELGIUM
BULGARIA AUSTRIA AUSTRIA
NaN CROATIA CROATIA
NaN NaN DENMARK
I have written conditions listing one of the three zones the user is subscribed to.
I have written values that select the country the user is in, and checks if that country is in the zone the user is subscribed to.
conditions = [
(df1['alias3']=='Z1'),
(df1['alias3']=='Z2'),
(df1['alias3']=='Z3')
]
values = [
df1['country'].str.upper().isin(country_zone['Zone 1']),
df1['country'].str.upper().isin(country_zone['Zone 2']),
df1['country'].str.upper().isin(country_zone['Zone 3'])
]
df1['valid_country'] = np.select(conditions, values)
Is there a better way to do this in pandas?
One easy way would be:
def valid(sdf):
zone = sdf.alias3.iat[0][-1]
sdf["valid_country"] = sdf.country.str.upper().isin(df2[f"Zone {zone}"])
return sdf
df1 = df1.groupby("alias3").apply(valid)
groupby df1 over the alias3s and then
apply a function to the groups that checks if the uppered country names in the group's country column are in the respective column of df2 and stores the result in a column named valid_country
Another way would be to alter df2 a bit:
df2.columns = df2.columns.str.replace("one ", "")
df2 = (
df2.melt(var_name="alias3", value_name="country")
.dropna()
.assign(valid_country=True)
)
df2.country = df2.country.str.capitalize()
Transforming the column names from 'Zone 1/2/3' to 'Z1/2/3'
melt-ing: the Zone-column names into a column named alias3 with the respective country names in a column named country
Dropping the NaNs
Adding a column named valid_country all True
Capitalizing the country names
And then:
df1 = df1.merge(df2, on=["alias3", "country"], how="left")
df1.valid_country[df1.valid_country.isna()] = False
Left-merge-ing the result with df1 on the columns alias3 and country
Filling in the missing False in the column valid_country

Merging/concatenating two datasets on a specific column (different lengths) [duplicate]

This question already has answers here:
Pandas Merging 101
(8 answers)
Closed 1 year ago.
I have two different datasets
df1
Name Surname Age Address
Julian Ross 34 Main Street
Mary Jane 52 Cook Road
len(1200)
df2
Name Country Telephone
Julian US NA
len(800)
df1 contains the full list of unique names; df2 contains less rows as many Name were not added.
I would like to get a final dataset with the full list of names in df1 (and all the fields that are there) plus the fields in df2. I would then expect a final dataset of length 1200 with some empty fields corresponding to the missing name in df2.
I have tried as follows:
pd.concat([df1.set_index('Name'),df2.set_index('Name')], axis=1, join='inner')
but it returns the length of the smallest dataset (i.e. 800).
I have also tried
df1.merge(df2, how = 'inner', on = ['Name'])
... same result.
I am not totally familiar with joining/merging/concatenating functions, even after reading the document https://pandas.pydata.org/docs/user_guide/merging.html .
I know that probably this question will be a duplicate of some others and I will be happy to delete it if necessary, but I would be really grateful if you could provide same help and explaining how to get the expected result:
df
Name Surname Age Address Country Telephone
Julian Ross 34 Main Street US NA
Mary Jane 52 Cook Road
IIUC, Use pd.merge like below:
>>> df1.merge(df2, how='left', on='Name')
Name Surname Age Address Country Telephone
0 Julian Ross 34 Main Street US NaN
1 Mary Jane 52 Cook Road NaN NaN
If you want to keep the number of rows of df1, you have to use how='left' in the case where there is no duplicate names in df2.
Read Pandas Merging 101

Is there a way to group a dataframe by certain elements of its content without a given delimiter?

I'm facing the following situation. I have a dataframe which looks like this (due to sensitivity of the data I have to paraphrase it)
Column A Column B
A1B12C123 Japan
A2B34C456 Switzerland
A3B45C789 Japan
A1B15C729 Japan
My goal is to group Column A by the recurring pattern, which describes a certain property.
Meaning: Group by A1, Group by B12, Group by C123.
In order to do that, I split the Column and created new ones for each level of hierarchy, e.g.:
Column A Column B Column C
A1 B12 C123
A2 B34 C456
A3 B45 C789
A1 B15 C729
Those columns I have to add to my existing dataframe, and then I'll be able to group the way I wanted to.
I think this can work, but it seems a bit tedious and unelegant.
Is there a possibility or a way in Pandas to do this more elegantly?
I'd be happy about any input on that matter.
Best regards
Taking Seyi Daniel's idea from the comments, you can use exctractall() string method on the Column A to explode it based on regex groups and join Column B on to it.
import pandas as pd
from io import StringIO
data = StringIO("""
Column_A Column_B
A1B12C123 Japan
A2B34C456 Switzerland
A3B45C789 Japan
A1B15C729 Japan
""")
df = pd.read_csv(data, delim_whitespace=True)
regex_df = df["Column_A"].str.extractall(r"(A\d*)|(B\d*)|(C\d*)")
# drop extra levels
regex_s = regex_df.stack().reset_index((1,2), drop=True)
# give the new column a name
regex_s.name = "group"
# add column B
result = pd.merge(regex_s, df["Column_B"], left_index=True, right_index=True)
print(result)
group Column_B
0 A1 Japan
0 B12 Japan
0 C123 Japan
1 A2 Switzerland
1 B34 Switzerland
1 C456 Switzerland
2 A3 Japan
2 B45 Japan
2 C789 Japan
3 A1 Japan
3 B15 Japan
3 C729 Japan

Filter rows based on multiple criteria

I have the following dataframe:
name date_one date_two
-----------------------------------------
sue
sue
john
john 13-06-2019
sally 23-04-2019
sally 23-04-2019 25-04-2019
bob 18-05-2019 14-06-2019
bob 18-05-2019 17-06-2019
The data contains duplicate name rows. I need to filter the data based on the following (in this order of priority):
For each name, keep the row with the newest date_two. If the name doesn't have any rows which have values for date_two, go to step 2
For each name, keep the row with the newest date_one. If the name doesn't have any rows which have values for date_one, go to step 3
These names don't have any rows which have a date_one or date_two so just keep the first row for that name
The above dataframe would be filtered to:
name date_one date_two
-----------------------------------------
sue
john 13-06-2019
sally 23-04-2019 25-04-2019
bob 18-05-2019 17-06-2019
This doesn't need to be done in the most performant way. The dataframe is only a few thousand rows and only needs to be done once. If it needs to be done in multiple (slow) steps that's fine.
Use DataFrameGroupBy.idxmax per groups for rows by maximal values, then filter out already matched values by Series.isin and last join together value by concat:
df['date_one'] = pd.to_datetime(df['date_one'], dayfirst=True)
df['date_two'] = pd.to_datetime(df['date_two'], dayfirst=True)
#rule1
df1 = df.loc[df.groupby('name')['date_two'].idxmax().dropna()]
#rule2
df2 = df.loc[df.groupby('name')['date_one'].idxmax().dropna()]
df2 = df2[~df2['name'].isin(df1['name'])]
#rule3
df3 = df[~df['name'].isin(df1['name'].append(df2['name']))].drop_duplicates('name')
df = pd.concat([df1, df2, df3]).sort_index()
print (df)
name date_one date_two
0 sue NaT NaT
3 john 2019-06-13 NaT
5 sally 2019-04-23 2019-04-25
7 bob 2019-05-18 2019-06-17

Categories