I have two dataframes like as below
ID,Name,Sub,Country
1,ABC,ENG,UK
1,ABC,MATHS,UK
1,ABC,Science,UK
2,ABE,ENG,USA
2,ABE,MATHS,USA
2,ABE,Science,USA
3,ABF,ENG,IND
3,ABF,MATHS,IND
3,ABF,Science,IND
df1 = pd.read_clipboard(sep=',')
ID,Name,class,age
11,ABC,ENG,21
12,ABC,MATHS,23
1,ABC,Science,25
22,ABE,ENG,19
23,ABE,MATHS,22
24,ABE,Science,26
33,ABF,ENG,24
31,ABF,MATHS,28
32,ABF,Science,26
df2 = pd.read_clipboard(sep=',')
I would like to do the below
a) Check whether the ID and Name from df1 is present in df2.
b) If present in df2, put Yes in Status column or No in Status column. Don't use ~ or not in operator because my df2 has million of rows. So, it will result in irrelevant results
I tried the below
ID_list = df1['ID'].unique().tolist()
Name_list = df1['Name'].unique().tolist()
filtered_df = df2[((df2['ID'].isin(ID_list)) & (df2['Name'].isin(Name_list)))]
filtered_df = filtered_df.groupby(['ID','Name','Sub']).size().reset_index()
The above code gives matching ids and names between df1 and df2.
But I want to find the ids and names that are present in df1 but missing/absent in df2. I cannot use the ~ operator because it will return all the rows from df2 that don't have a match in df1. In real world, my df2 has millions of rows. I only want to find the missing df1 ids and names and put a status column
I expect my output to be like as below
ID,Name,Sub,Country, Status
1,ABC,ENG,UK,No
1,ABC,MATHS,UK,No
1,ABC,Science,UK,Yes
2,ABE,ENG,USA,No
2,ABE,MATHS,USA,No
2,ABE,Science,USA,No
3,ABF,ENG,IND,No
3,ABF,MATHS,IND,No
3,ABF,Science,IND,No
Expected ouput is for match by 3 columns:
m = df1.merge(df2,
left_on=['ID','Name','Sub'],
right_on=['ID','Name','class'],
indicator=True, how='left')['_merge'].eq('both')
df1['Status'] = np.where(m, 'Yes', 'No')
print (df1)
ID Name Sub Country Status
0 1 ABC ENG UK No
1 1 ABC MATHS UK No
2 1 ABC Science UK Yes
3 2 ABE ENG USA No
4 2 ABE MATHS USA No
5 2 ABE Science USA No
6 3 ABF ENG IND No
7 3 ABF MATHS IND No
8 3 ABF Science IND No
With testing by isin solution is:
idx1 = pd.MultiIndex.from_frame(df1[['ID','Name','Sub']])
idx2 = pd.MultiIndex.from_frame(df2[['ID','Name','class']])
df1['Status'] = np.where(idx1.isin(idx2), 'Yes', 'No')
print (df1)
ID Name Sub Country Status
0 1 ABC ENG UK No
1 1 ABC MATHS UK No
2 1 ABC Science UK Yes
3 2 ABE ENG USA No
4 2 ABE MATHS USA No
5 2 ABE Science USA No
6 3 ABF ENG IND No
7 3 ABF MATHS IND No
8 3 ABF Science IND No
Because if match by 2 columns ouput is different:
m = df1.merge(df2, on=['ID','Name'], indicator=True, how='left')['_merge'].eq('both')
df1['Status'] = np.where(m, 'Yes', 'No')
print (df1)
ID Name Sub Country Status
0 1 ABC ENG UK Yes
1 1 ABC MATHS UK Yes
2 1 ABC Science UK Yes
3 2 ABE ENG USA No
4 2 ABE MATHS USA No
5 2 ABE Science USA No
6 3 ABF ENG IND No
7 3 ABF MATHS IND No
8 3 ABF Science IND No
DATASET I CURRENTLY HAVE-
COUNTRY city id tag dummy
India ackno 1 2 1
China open 0 0 1
India ackno 1 2 1
China open 0 0 1
USA open 0 0 1
USA open 0 0 1
China ackno 1 2 1
USA ackno 1 2 1
USA resol 1 0 1
Russia open 0 0 1
Italy open 0 0 1
country=df['country'].unique().tolist()
city=['resol','ackno']
#below are the preferred filters for calculating column percentage
df_looped=df[(df['city'].isin(city)) & (df['id']!=0) | (df['tag']!=0)]
percentage=(df_looped/df)*100
df_summed=df.groupby(['COUNTRY']).agg({'COUNTRY':'count'})
summed=df_summed['COUNTRY'].sum(axis=0)
THE DATASET I WANT-
COUNTRY percentage summed
india 100% 2
China 66.66% 3
USA 25% 4
Russia 0% 1
Italy 0% 1
percentage should be derived from the above formula for every unique country and same for the sum.
percentage variable and summed variable should populate the columns.
You can create helper column a by your conditions and for percentages of Trues use mean, for count values used GroupBy.size (because GroupBy.count omit misisng values and here no missing values) and last divide columns:
city=['resol','ackno']
df = (df.assign(a = (df['city'].isin(city) & (df['id']!=0) | (df['tag']!=0)))
.groupby('COUNTRY', sort=False)
.agg(percentage= ('a','mean'),summed=('a', 'size'))
.assign(percentage = lambda x: x['percentage'].mul(100).round(2))
)
print (df)
percentage summed
COUNTRY
India 100.00 2
China 33.33 3
USA 50.00 4
Russia 0.00 1
Italy 0.00 1
You can use pivot_table with a dict of functions to apply to your dataframe. You have to assign before a new column with your conditions (looped):
funcs = {
'looped': [
('percentage', lambda x: f"{round(sum(x) / len(x) * 100, 2)}%"),
('summed', 'size')
]
}
# Your code without df[...]
looped = (df['city'].isin(city)) & (df['id'] != 0) | (df['tag'] != 0)
out = df.assign(looped=looped).pivot_table('looped', 'COUNTRY', aggfunc=funcs)
Output:
>>> out
percentage summed
COUNTRY
China 33.33% 3
India 100.0% 2
Italy 0.0% 1
Russia 0.0% 1
USA 50.0% 4
for country in df1['country ']:
for street,City in zip(df2.street, df2.City):
if re.match(r'[A-Za-z]+\:'+ street + r'\.'+ City,country ):
s = (re.match(r'[A-Za-z]+\:'+ street + r'\.'+ TR +
r'\_(VS).+',country))
Matches += 1
print(s)
print(Matches)
df1:
UID country
0 1 Gervais Philippon:France.PARISPenthièvre25
1 2 Jed Turner:England.LONDONQueensway69
2 3 Lino Jimenez:Spain.MADRIDChavela33
df2:
UID country City
0 1 France PARIS
1 2 Spain MADRID
2 3 England LONDON
Expected output:
UID country UID_df2
0 1 Gervais Philippon:France.PARISPenthièvre25 1
1 2 Jed Turner:England.LONDONQueensway69 3
2 3 Lino Jimenez:Spain.MADRIDChavela33 2
The matches are shown correctly. How can i link the dataframes by assigning the matched string to the other dataframe ? I would like the ideal format:
Thank you.
First I would renamed country in df1 to data or something else so it doesn't get confused with country in df2
df1 = df1.rename(columns={'country': 'data'})
Get the country and City data
df1[['country', 'City']] = df1['data'].str.extract('(:([A-Z]+[a-z]*)).([A-Z]+)', expand=True)[[1, 2]]
Fix the regex in the City name, this step can be removed by updating the regex above
df1['City'] = df1['City'].map(lambda x: x[:-1])
Finally merge with df2
df1.merge(df2, on=['country', 'City'])
UID_x place country City UID_y
0 1 Gervais Philippon:France.PARISPenthièvre25 France PARIS 1
1 2 Jed Turner:England.LONDONQueensway69 England LONDON 3
2 3 Lino Jimenez:Spain.MADRIDChavela33 Spain MADRID 2
I am working on IPL dataset which has many categorical variables one such variable is toss_winner. I have created dummy variable for this and now I have 15 columns with binary values. I want to merge all these column into single column with numbers 0-14 each number representing IPL team.
IIUC, Use:
df['Team No.'] = dummies.cumsum(axis=1).ne(1).sum(axis=1)
Example,
df = pd.DataFrame({'Toss winner': ['Chennai', 'Mumbai', 'Rajasthan', 'Banglore', 'Hyderabad']})
dummies = pd.get_dummies(df['Toss winner'])
df['Team No.'] = dummies.cumsum(axis=1).ne(1).sum(axis=1)
Result:
# print(dummies)
Banglore Chennai Hyderabad Mumbai Rajasthan
0 0 1 0 0 0
1 0 0 0 1 0
2 0 0 0 0 1
3 1 0 0 0 0
4 0 0 1 0 0
# print (df)
Toss winner Team No.
0 Chennai 1
1 Mumbai 3
2 Rajasthan 4
3 Banglore 0
4 Hyderabad 2
I am trying to produce a report and then I run the below code
import pandas as pd
df = pd.read_excel("proposals2020.xlsx", sheet_name="Proposals")
country_probability = df.groupby(["Country", "Probability"]).count()
country_probability = country_probability.unstack()
country_probability = country_probability.fillna("0")
country_probability = country_probability.drop(country_probability.columns[4:], axis=1)
country_probability = country_probability.drop(country_probability.columns[0], axis=1)
country_probability = country_probability.astype(int)
print(country_probability)
I get the below results:
Quote Number
Probability High Low Medium
Country
Algeria 3 1 9
Bahrain 4 3 2
Egypt 2 0 3
Iraq 3 0 8
Jordan 0 1 1
Lebanon 0 1 0
Libya 1 0 0
Morocco 0 0 2
Pakistan 3 10 11
Qatar 0 1 1
Saudi Arabia 16 8 19
Tunisia 2 5 0
USA 0 1 0
My question is how to stop pandas from sorting these columns alphabetically and keep the High, Medium, Low order...
DataFrame.reindex
# if isinstance(df.columns, pd.MultiIndex)
df = df.reindex(['High', 'Medium', 'Low'], axis=1, level=1)
If not MultiIndex in columns:
# if isinstance(df.columns, pd.Index)
df = df.reindex(['High', 'Medium', 'Low'], axis=1)
We can also try pass sort = False in groupby:
country_probability = df.groupby(["Country", "Probability"], sort=False).count()