Use pandas to mark cell X if id in country - python

To start with, I have 3 Excel files, canada.xlsx, mexico.xlsx and usa.xlsx, each has 3 columns: id, a number, ColA, a text like Val1, and Country, each Excel file has only the country of its name in the third column, like only Canada in canada.xlsx
I make a df:
import pandas as pd
import glob
savepath = '/home/pedro/myPython/pandas/xl_files/'
saveoutputpath = '/home/pedro/myPython/pandas/outputxl/'
# I put an extra column in each excel file named country with either Canada, Mexico or USA
filelist = glob.glob(savepath + "*.xlsx")
# open the xl files with the data
# put all the data in 1 df
df = pd.concat((pd.read_excel(f) for f in filelist))
# change the indexes to get unique indexes
# df.index.size gets how many indexes there are
indexes = []
for i in range(df.index.size):
indexes.append(i)
# now change the indexes pass a list to df.index
# never good to have 2 indexes the same
df.index = indexes
I make the output Excel, it has 4 columns, id, Canada, Mexico, USA. The point of the exercise is, write X in each country column with a corresponding id number, for example id 42345 may be in country Canada and Mexico, so 42345 should get an X in those 2 columns
I made this work, but I extracted the data from df to a dictionary. I tried various ways of doing this with df.loc() or df.iloc() but I can't seem to make it. I don't use pandas much.
This is the output df_out
# get a list of the ids
mylist = df["id"].values.tolist()
# get a set of the unique ids
myset = set(mylist)
#create new DataFrame with unique values in the column id
df_out = pd.DataFrame(columns=['id', 'Canada', 'Mexico', 'USA'], index=range(0, len(myset)))
df_out.fillna(0, inplace=True)
# make a list of unique ids and sort them
id_names = list(myset)
id_names.sort()
# populate the id column with id_names
df_out["id"] = id_names
# see how many rows and columns
print(df_out.shape)
# mydict[key][0] is the id column , mydict[key][2]]is the country
for key in mydict.keys():
df_out.loc[df_out["id"] == mydict[key][0], mydict[key][2]] = "X"
Can you help me with a more "pandas way" of writing the X in df_out directly from df?
df:
id Col A country
0 42345 Test 1 USA
1 681593 Test 2 USA
2 331574 Test 3 USA
3 15786 Test 4 USA
4 93512 Chk1 Mexico
5 681593 Chk2 Mexico
6 331574 Chk3 Mexico
7 89153 Chk4 Mexico
8 42345 Val1 Canada
9 93512 Val2 Canada
10 331574 Val3 Canada
11 76543 Val4 Canada
df_out:
id Canada Mexico USA
0 15786 0 0 0
1 42345 0 0 0
2 76543 0 0 0
3 89153 0 0 0
4 93512 0 0 0
5 331574 0 0 0
6 681593 0 0 0

What you want is a pivot table.
pd.pivot_table(df, index='id', columns='country', aggfunc=lambda z: 'X', fill_value=0).rename_axis(None, axis=1).reset_index()
Input
id country
0 42345 USA
1 681593 USA
2 331574 USA
3 15786 USA
4 93512 Mexico
5 681593 Mexico
6 331574 Mexico
7 89153 Mexico
8 42345 Canada
9 93512 Canada
10 331574 Canada
11 76543 Canada
Output
id Canada Mexico USA
0 15786 0 0 X
1 42345 X 0 X
2 76543 X 0 0
3 89153 0 X 0
4 93512 X X 0
5 331574 X X X
6 681593 0 X X

Related

Pandas filter without ~ and not in operator

I have two dataframes like as below
ID,Name,Sub,Country
1,ABC,ENG,UK
1,ABC,MATHS,UK
1,ABC,Science,UK
2,ABE,ENG,USA
2,ABE,MATHS,USA
2,ABE,Science,USA
3,ABF,ENG,IND
3,ABF,MATHS,IND
3,ABF,Science,IND
df1 = pd.read_clipboard(sep=',')
ID,Name,class,age
11,ABC,ENG,21
12,ABC,MATHS,23
1,ABC,Science,25
22,ABE,ENG,19
23,ABE,MATHS,22
24,ABE,Science,26
33,ABF,ENG,24
31,ABF,MATHS,28
32,ABF,Science,26
df2 = pd.read_clipboard(sep=',')
I would like to do the below
a) Check whether the ID and Name from df1 is present in df2.
b) If present in df2, put Yes in Status column or No in Status column. Don't use ~ or not in operator because my df2 has million of rows. So, it will result in irrelevant results
I tried the below
ID_list = df1['ID'].unique().tolist()
Name_list = df1['Name'].unique().tolist()
filtered_df = df2[((df2['ID'].isin(ID_list)) & (df2['Name'].isin(Name_list)))]
filtered_df = filtered_df.groupby(['ID','Name','Sub']).size().reset_index()
The above code gives matching ids and names between df1 and df2.
But I want to find the ids and names that are present in df1 but missing/absent in df2. I cannot use the ~ operator because it will return all the rows from df2 that don't have a match in df1. In real world, my df2 has millions of rows. I only want to find the missing df1 ids and names and put a status column
I expect my output to be like as below
ID,Name,Sub,Country, Status
1,ABC,ENG,UK,No
1,ABC,MATHS,UK,No
1,ABC,Science,UK,Yes
2,ABE,ENG,USA,No
2,ABE,MATHS,USA,No
2,ABE,Science,USA,No
3,ABF,ENG,IND,No
3,ABF,MATHS,IND,No
3,ABF,Science,IND,No
Expected ouput is for match by 3 columns:
m = df1.merge(df2,
left_on=['ID','Name','Sub'],
right_on=['ID','Name','class'],
indicator=True, how='left')['_merge'].eq('both')
df1['Status'] = np.where(m, 'Yes', 'No')
print (df1)
ID Name Sub Country Status
0 1 ABC ENG UK No
1 1 ABC MATHS UK No
2 1 ABC Science UK Yes
3 2 ABE ENG USA No
4 2 ABE MATHS USA No
5 2 ABE Science USA No
6 3 ABF ENG IND No
7 3 ABF MATHS IND No
8 3 ABF Science IND No
With testing by isin solution is:
idx1 = pd.MultiIndex.from_frame(df1[['ID','Name','Sub']])
idx2 = pd.MultiIndex.from_frame(df2[['ID','Name','class']])
df1['Status'] = np.where(idx1.isin(idx2), 'Yes', 'No')
print (df1)
ID Name Sub Country Status
0 1 ABC ENG UK No
1 1 ABC MATHS UK No
2 1 ABC Science UK Yes
3 2 ABE ENG USA No
4 2 ABE MATHS USA No
5 2 ABE Science USA No
6 3 ABF ENG IND No
7 3 ABF MATHS IND No
8 3 ABF Science IND No
Because if match by 2 columns ouput is different:
m = df1.merge(df2, on=['ID','Name'], indicator=True, how='left')['_merge'].eq('both')
df1['Status'] = np.where(m, 'Yes', 'No')
print (df1)
ID Name Sub Country Status
0 1 ABC ENG UK Yes
1 1 ABC MATHS UK Yes
2 1 ABC Science UK Yes
3 2 ABE ENG USA No
4 2 ABE MATHS USA No
5 2 ABE Science USA No
6 3 ABF ENG IND No
7 3 ABF MATHS IND No
8 3 ABF Science IND No

Filtered Group-How to group one column with and add calculated columns

DATASET I CURRENTLY HAVE-
COUNTRY city id tag dummy
India ackno 1 2 1
China open 0 0 1
India ackno 1 2 1
China open 0 0 1
USA open 0 0 1
USA open 0 0 1
China ackno 1 2 1
USA ackno 1 2 1
USA resol 1 0 1
Russia open 0 0 1
Italy open 0 0 1
country=df['country'].unique().tolist()
city=['resol','ackno']
#below are the preferred filters for calculating column percentage
df_looped=df[(df['city'].isin(city)) & (df['id']!=0) | (df['tag']!=0)]
percentage=(df_looped/df)*100
df_summed=df.groupby(['COUNTRY']).agg({'COUNTRY':'count'})
summed=df_summed['COUNTRY'].sum(axis=0)
THE DATASET I WANT-
COUNTRY percentage summed
india 100% 2
China 66.66% 3
USA 25% 4
Russia 0% 1
Italy 0% 1
percentage should be derived from the above formula for every unique country and same for the sum.
percentage variable and summed variable should populate the columns.
You can create helper column a by your conditions and for percentages of Trues use mean, for count values used GroupBy.size (because GroupBy.count omit misisng values and here no missing values) and last divide columns:
city=['resol','ackno']
df = (df.assign(a = (df['city'].isin(city) & (df['id']!=0) | (df['tag']!=0)))
.groupby('COUNTRY', sort=False)
.agg(percentage= ('a','mean'),summed=('a', 'size'))
.assign(percentage = lambda x: x['percentage'].mul(100).round(2))
)
print (df)
percentage summed
COUNTRY
India 100.00 2
China 33.33 3
USA 50.00 4
Russia 0.00 1
Italy 0.00 1
You can use pivot_table with a dict of functions to apply to your dataframe. You have to assign before a new column with your conditions (looped):
funcs = {
'looped': [
('percentage', lambda x: f"{round(sum(x) / len(x) * 100, 2)}%"),
('summed', 'size')
]
}
# Your code without df[...]
looped = (df['city'].isin(city)) & (df['id'] != 0) | (df['tag'] != 0)
out = df.assign(looped=looped).pivot_table('looped', 'COUNTRY', aggfunc=funcs)
Output:
>>> out
percentage summed
COUNTRY
China 33.33% 3
India 100.0% 2
Italy 0.0% 1
Russia 0.0% 1
USA 50.0% 4

Compared two dataframes on strings. I would like to assign the matched strings to one of the dataframes in the right column and row

for country in df1['country ']:
for street,City in zip(df2.street, df2.City):
if re.match(r'[A-Za-z]+\:'+ street + r'\.'+ City,country ):
s = (re.match(r'[A-Za-z]+\:'+ street + r'\.'+ TR +
r'\_(VS).+',country))
Matches += 1
print(s)
print(Matches)
df1:
UID country
0 1 Gervais Philippon:France.PARISPenthièvre25
1 2 Jed Turner:England.LONDONQueensway69
2 3 Lino Jimenez:Spain.MADRIDChavela33
df2:
UID country City
0 1 France PARIS
1 2 Spain MADRID
2 3 England LONDON
Expected output:
UID country UID_df2
0 1 Gervais Philippon:France.PARISPenthièvre25 1
1 2 Jed Turner:England.LONDONQueensway69 3
2 3 Lino Jimenez:Spain.MADRIDChavela33 2
The matches are shown correctly. How can i link the dataframes by assigning the matched string to the other dataframe ? I would like the ideal format:
Thank you.
First I would renamed country in df1 to data or something else so it doesn't get confused with country in df2
df1 = df1.rename(columns={'country': 'data'})
Get the country and City data
df1[['country', 'City']] = df1['data'].str.extract('(:([A-Z]+[a-z]*)).([A-Z]+)', expand=True)[[1, 2]]
Fix the regex in the City name, this step can be removed by updating the regex above
df1['City'] = df1['City'].map(lambda x: x[:-1])
Finally merge with df2
df1.merge(df2, on=['country', 'City'])
UID_x place country City UID_y
0 1 Gervais Philippon:France.PARISPenthièvre25 France PARIS 1
1 2 Jed Turner:England.LONDONQueensway69 England LONDON 3
2 3 Lino Jimenez:Spain.MADRIDChavela33 Spain MADRID 2

How to merge multiple dummy variables columns which were created from a single categorical variable into single column in python?

I am working on IPL dataset which has many categorical variables one such variable is toss_winner. I have created dummy variable for this and now I have 15 columns with binary values. I want to merge all these column into single column with numbers 0-14 each number representing IPL team.
IIUC, Use:
df['Team No.'] = dummies.cumsum(axis=1).ne(1).sum(axis=1)
Example,
df = pd.DataFrame({'Toss winner': ['Chennai', 'Mumbai', 'Rajasthan', 'Banglore', 'Hyderabad']})
dummies = pd.get_dummies(df['Toss winner'])
df['Team No.'] = dummies.cumsum(axis=1).ne(1).sum(axis=1)
Result:
# print(dummies)
Banglore Chennai Hyderabad Mumbai Rajasthan
0 0 1 0 0 0
1 0 0 0 1 0
2 0 0 0 0 1
3 1 0 0 0 0
4 0 0 1 0 0
# print (df)
Toss winner Team No.
0 Chennai 1
1 Mumbai 3
2 Rajasthan 4
3 Banglore 0
4 Hyderabad 2

Stop Pandas from sorting columns

I am trying to produce a report and then I run the below code
import pandas as pd
df = pd.read_excel("proposals2020.xlsx", sheet_name="Proposals")
country_probability = df.groupby(["Country", "Probability"]).count()
country_probability = country_probability.unstack()
country_probability = country_probability.fillna("0")
country_probability = country_probability.drop(country_probability.columns[4:], axis=1)
country_probability = country_probability.drop(country_probability.columns[0], axis=1)
country_probability = country_probability.astype(int)
print(country_probability)
I get the below results:
Quote Number
Probability High Low Medium
Country
Algeria 3 1 9
Bahrain 4 3 2
Egypt 2 0 3
Iraq 3 0 8
Jordan 0 1 1
Lebanon 0 1 0
Libya 1 0 0
Morocco 0 0 2
Pakistan 3 10 11
Qatar 0 1 1
Saudi Arabia 16 8 19
Tunisia 2 5 0
USA 0 1 0
My question is how to stop pandas from sorting these columns alphabetically and keep the High, Medium, Low order...
DataFrame.reindex
# if isinstance(df.columns, pd.MultiIndex)
df = df.reindex(['High', 'Medium', 'Low'], axis=1, level=1)
If not MultiIndex in columns:
# if isinstance(df.columns, pd.Index)
df = df.reindex(['High', 'Medium', 'Low'], axis=1)
We can also try pass sort = False in groupby:
country_probability = df.groupby(["Country", "Probability"], sort=False).count()

Categories