I am trying to produce a report and then I run the below code
import pandas as pd
df = pd.read_excel("proposals2020.xlsx", sheet_name="Proposals")
country_probability = df.groupby(["Country", "Probability"]).count()
country_probability = country_probability.unstack()
country_probability = country_probability.fillna("0")
country_probability = country_probability.drop(country_probability.columns[4:], axis=1)
country_probability = country_probability.drop(country_probability.columns[0], axis=1)
country_probability = country_probability.astype(int)
print(country_probability)
I get the below results:
Quote Number
Probability High Low Medium
Country
Algeria 3 1 9
Bahrain 4 3 2
Egypt 2 0 3
Iraq 3 0 8
Jordan 0 1 1
Lebanon 0 1 0
Libya 1 0 0
Morocco 0 0 2
Pakistan 3 10 11
Qatar 0 1 1
Saudi Arabia 16 8 19
Tunisia 2 5 0
USA 0 1 0
My question is how to stop pandas from sorting these columns alphabetically and keep the High, Medium, Low order...
DataFrame.reindex
# if isinstance(df.columns, pd.MultiIndex)
df = df.reindex(['High', 'Medium', 'Low'], axis=1, level=1)
If not MultiIndex in columns:
# if isinstance(df.columns, pd.Index)
df = df.reindex(['High', 'Medium', 'Low'], axis=1)
We can also try pass sort = False in groupby:
country_probability = df.groupby(["Country", "Probability"], sort=False).count()
Related
To start with, I have 3 Excel files, canada.xlsx, mexico.xlsx and usa.xlsx, each has 3 columns: id, a number, ColA, a text like Val1, and Country, each Excel file has only the country of its name in the third column, like only Canada in canada.xlsx
I make a df:
import pandas as pd
import glob
savepath = '/home/pedro/myPython/pandas/xl_files/'
saveoutputpath = '/home/pedro/myPython/pandas/outputxl/'
# I put an extra column in each excel file named country with either Canada, Mexico or USA
filelist = glob.glob(savepath + "*.xlsx")
# open the xl files with the data
# put all the data in 1 df
df = pd.concat((pd.read_excel(f) for f in filelist))
# change the indexes to get unique indexes
# df.index.size gets how many indexes there are
indexes = []
for i in range(df.index.size):
indexes.append(i)
# now change the indexes pass a list to df.index
# never good to have 2 indexes the same
df.index = indexes
I make the output Excel, it has 4 columns, id, Canada, Mexico, USA. The point of the exercise is, write X in each country column with a corresponding id number, for example id 42345 may be in country Canada and Mexico, so 42345 should get an X in those 2 columns
I made this work, but I extracted the data from df to a dictionary. I tried various ways of doing this with df.loc() or df.iloc() but I can't seem to make it. I don't use pandas much.
This is the output df_out
# get a list of the ids
mylist = df["id"].values.tolist()
# get a set of the unique ids
myset = set(mylist)
#create new DataFrame with unique values in the column id
df_out = pd.DataFrame(columns=['id', 'Canada', 'Mexico', 'USA'], index=range(0, len(myset)))
df_out.fillna(0, inplace=True)
# make a list of unique ids and sort them
id_names = list(myset)
id_names.sort()
# populate the id column with id_names
df_out["id"] = id_names
# see how many rows and columns
print(df_out.shape)
# mydict[key][0] is the id column , mydict[key][2]]is the country
for key in mydict.keys():
df_out.loc[df_out["id"] == mydict[key][0], mydict[key][2]] = "X"
Can you help me with a more "pandas way" of writing the X in df_out directly from df?
df:
id Col A country
0 42345 Test 1 USA
1 681593 Test 2 USA
2 331574 Test 3 USA
3 15786 Test 4 USA
4 93512 Chk1 Mexico
5 681593 Chk2 Mexico
6 331574 Chk3 Mexico
7 89153 Chk4 Mexico
8 42345 Val1 Canada
9 93512 Val2 Canada
10 331574 Val3 Canada
11 76543 Val4 Canada
df_out:
id Canada Mexico USA
0 15786 0 0 0
1 42345 0 0 0
2 76543 0 0 0
3 89153 0 0 0
4 93512 0 0 0
5 331574 0 0 0
6 681593 0 0 0
What you want is a pivot table.
pd.pivot_table(df, index='id', columns='country', aggfunc=lambda z: 'X', fill_value=0).rename_axis(None, axis=1).reset_index()
Input
id country
0 42345 USA
1 681593 USA
2 331574 USA
3 15786 USA
4 93512 Mexico
5 681593 Mexico
6 331574 Mexico
7 89153 Mexico
8 42345 Canada
9 93512 Canada
10 331574 Canada
11 76543 Canada
Output
id Canada Mexico USA
0 15786 0 0 X
1 42345 X 0 X
2 76543 X 0 0
3 89153 0 X 0
4 93512 X X 0
5 331574 X X X
6 681593 0 X X
having the following dataframe:
import pandas as pd
cars = ["BMV", "Mercedes", "Audi"]
customer = ["Juan", "Pepe", "Luis"]
price = [100, 200, 300]
year = [2022, 2021, 2020]
df_raw = pd.DataFrame(list(zip(cars, customer, price, year)),\
columns=["cars", "customer", "price", 'year'])
I need to do one-hot encoding from the categorical variables cars and customer, for this I use the get_dummies method for these two columns.
numerical = ["price", "year"]
df_final = pd.concat([df_raw[numerical], pd.get_dummies(df_raw.cars),\
pd.get_dummies(df_raw.customer)], axis=1)
Is there a way to generate these dummies in a dynamic way, like putting them in a list and loop through them with a for.In this case it may seem simple because I only have 2, but if I had 30 or 60 attributes, would I have to go one by one?
pd.get_dummies
pd.get_dummies(df_raw, columns=['cars', 'customer'])
price year cars_Audi cars_BMV cars_Mercedes customer_Juan customer_Luis customer_Pepe
0 100 2022 0 1 0 1 0 0
1 200 2021 0 0 1 0 0 1
2 300 2020 1 0 0 0 1 0
One simple way is to concatenate the columns and use str.get_dummies:
cols = ['cars', 'customer']
out = df_raw.join(df_raw[cols].agg('|'.join, axis=1).str.get_dummies())
output:
cars customer price year Audi BMV Juan Luis Mercedes Pepe
0 BMV Juan 100 2022 0 1 1 0 0 0
1 Mercedes Pepe 200 2021 0 0 0 0 1 1
2 Audi Luis 300 2020 1 0 0 1 0 0
Another option is to melt and use crosstab:
df2 = df_raw[cols].reset_index().melt('index')
out = df_raw.join(pd.crosstab(df2['index'], df2['value']))
DATASET I CURRENTLY HAVE-
COUNTRY city id tag dummy
India ackno 1 2 1
China open 0 0 1
India ackno 1 2 1
China open 0 0 1
USA open 0 0 1
USA open 0 0 1
China ackno 1 2 1
USA ackno 1 2 1
USA resol 1 0 1
Russia open 0 0 1
Italy open 0 0 1
country=df['country'].unique().tolist()
city=['resol','ackno']
#below are the preferred filters for calculating column percentage
df_looped=df[(df['city'].isin(city)) & (df['id']!=0) | (df['tag']!=0)]
percentage=(df_looped/df)*100
df_summed=df.groupby(['COUNTRY']).agg({'COUNTRY':'count'})
summed=df_summed['COUNTRY'].sum(axis=0)
THE DATASET I WANT-
COUNTRY percentage summed
india 100% 2
China 66.66% 3
USA 25% 4
Russia 0% 1
Italy 0% 1
percentage should be derived from the above formula for every unique country and same for the sum.
percentage variable and summed variable should populate the columns.
You can create helper column a by your conditions and for percentages of Trues use mean, for count values used GroupBy.size (because GroupBy.count omit misisng values and here no missing values) and last divide columns:
city=['resol','ackno']
df = (df.assign(a = (df['city'].isin(city) & (df['id']!=0) | (df['tag']!=0)))
.groupby('COUNTRY', sort=False)
.agg(percentage= ('a','mean'),summed=('a', 'size'))
.assign(percentage = lambda x: x['percentage'].mul(100).round(2))
)
print (df)
percentage summed
COUNTRY
India 100.00 2
China 33.33 3
USA 50.00 4
Russia 0.00 1
Italy 0.00 1
You can use pivot_table with a dict of functions to apply to your dataframe. You have to assign before a new column with your conditions (looped):
funcs = {
'looped': [
('percentage', lambda x: f"{round(sum(x) / len(x) * 100, 2)}%"),
('summed', 'size')
]
}
# Your code without df[...]
looped = (df['city'].isin(city)) & (df['id'] != 0) | (df['tag'] != 0)
out = df.assign(looped=looped).pivot_table('looped', 'COUNTRY', aggfunc=funcs)
Output:
>>> out
percentage summed
COUNTRY
China 33.33% 3
India 100.0% 2
Italy 0.0% 1
Russia 0.0% 1
USA 50.0% 4
I have a dataframe in Pandas with some columns, something like this:
data = {
'CODIGO_SINIESTRO': [10476434, 10476434, 4482524, 4482524, 4486110],
'CONDICION': ['PASAJERO', 'CONDUCTOR', 'MOTOCICLISTA', 'CICLISTA', 'PEATON'],
'EDAD': [62.0, 29.0, 26.0, 47.0, 33.0],
'SEXO': ['MASCULINO', 'FEMENINO', 'FEMENINO', 'MASCULINO', 'FEMENINO']
}
df = pd.DataFrame(data)
Output:
CODIGO_SINIESTRO CONDICION EDAD SEXO
0 10476434 PASAJERO 62.0 MASCULINO
1 10476434 CONDUCTOR 29.0 MASCULINO
2 4482524 MOTOCICLISTA 26.0 MASCULINO
3 4482524 CICLISTA 47.0 MASCULINO
4 4486110 PEATON 33.0 FEMENINO
So, I want to create another dataframe grouped by 'CODIGO_SINIESTRO' column, and I want the following columns like result:
'CODIGO_SINIESTRO': Id of the row.
'PROMEDIO_EDAD': This column will store edad mean.
'CANTIDAD_HOMBRES': This column will store masculine counts based on 'SEXO' column.
'CANTIDAD_HOMBRES': This column will store femenine counts based on 'SEXO' column.
Finally I want five extra columns named equal to the four values possibles of 'CONDICION' column, this values will store 1 if value exist or 0 if not.
So, I wrote this solution and working as expect, however I have many rows in my dataset (150k+) and the solution is slow (5 minutes). This is my code:
df_final = df.groupby(['CODIGO_SINIESTRO']).agg(
CANTIDAD_HOMBRES=pd.NamedAgg(column='SEXO', aggfunc=lambda x: (x=='MASCULINO').sum()),
CANTIDAD_MUJERES=pd.NamedAgg(column='SEXO', aggfunc=lambda x: (x=='FEMENINO').sum()),
PROMEDIO_EDAD=pd.NamedAgg(column='EDAD', aggfunc=np.mean),
MOTOCICLISTA=pd.NamedAgg(column='CONDICION', aggfunc=lambda x: (x=='MOTOCICLISTA').any().astype(int)),
CONDUCTOR=pd.NamedAgg(column='CONDICION', aggfunc=lambda x: (x=='CONDUCTOR').any().astype(int)),
PEATON=pd.NamedAgg(column='CONDICION', aggfunc=lambda x: (x=='PEATON').any().astype(int)),
CICLISTA=pd.NamedAgg(column='CONDICION', aggfunc=lambda x: (x=='CICLISTA').any().astype(int)),
PASAJERO=pd.NamedAgg(column='CONDICION', aggfunc=lambda x: (x=='PASAJERO').any().astype(int))
).reset_index()
Output:
CODIGO_SINIESTRO CANTIDAD_HOMBRES CANTIDAD_MUJERES PROMEDIO_EDAD ...
0 4482524 1 1 36.5
1 4486110 0 1 33.0
2 10476434 1 1 45.5
... MOTOCICLISTA CONDUCTOR PEATON CICLISTA PASAJERO
1 0 0 1 0
0 0 1 0 0
0 1 0 0 1
How can I optimize this solution?, Are there other ways for resolving that?
Thank you.
Pre-aggregating with vectorized methods should be much more efficient (it turns out it was 100x faster):
df['PROMEDIO_EDAD']= df.groupby('CODIGO_SINIESTRO')['EDAD'].transform(np.mean)
df['CANTIDAD_HOMBRES'] = np.where(df['SEXO'] == 'MASCULINO', 1, 0)
df['CANTIDAD_MUJERES'] = np.where(df['SEXO'] == 'FEMENINO', 1, 0)
for col in df['CONDICION'].unique():
df[col] = np.where(df['CONDICION'] == col, 1, 0)
df = df.groupby(['CODIGO_SINIESTRO', 'PROMEDIO_EDAD']).sum().reset_index().drop('EDAD', axis=1)
df.iloc[:,2:] = (df.iloc[:,2:] > 0).astype(int)
df
Out[1]:
CODIGO_SINIESTRO PROMEDIO_EDAD CANTIDAD_HOMBRES CANTIDAD_MUJERES \
0 4482524 36.5 1 1
1 4486110 33.0 0 1
2 10476434 45.5 1 1
PASAJERO CONDUCTOR MOTOCICLISTA CICLISTA PEATON
0 0 0 1 1 0
1 0 0 0 0 1
2 1 1 0 0 0
Is there a shorter way of dropping a column MultiIndex level (in my case, basic_amt) except transposing it twice?
In [704]: test
Out[704]:
basic_amt
Faculty NSW QLD VIC All
All 1 1 2 4
Full Time 0 1 0 1
Part Time 1 0 2 3
In [705]: test.reset_index(level=0, drop=True)
Out[705]:
basic_amt
Faculty NSW QLD VIC All
0 1 1 2 4
1 0 1 0 1
2 1 0 2 3
In [711]: test.transpose().reset_index(level=0, drop=True).transpose()
Out[711]:
Faculty NSW QLD VIC All
All 1 1 2 4
Full Time 0 1 0 1
Part Time 1 0 2 3
Another solution is to use MultiIndex.droplevel with rename_axis (new in pandas 0.18.0):
import pandas as pd
cols = pd.MultiIndex.from_arrays([['basic_amt']*4,
['NSW','QLD','VIC','All']],
names = [None, 'Faculty'])
idx = pd.Index(['All', 'Full Time', 'Part Time'])
df = pd.DataFrame([(1,1,2,4),
(0,1,0,1),
(1,0,2,3)], index = idx, columns=cols)
print (df)
basic_amt
Faculty NSW QLD VIC All
All 1 1 2 4
Full Time 0 1 0 1
Part Time 1 0 2 3
df.columns = df.columns.droplevel(0)
#pandas 0.18.0 and higher
df = df.rename_axis(None, axis=1)
#pandas bellow 0.18.0
#df.columns.name = None
print (df)
NSW QLD VIC All
All 1 1 2 4
Full Time 0 1 0 1
Part Time 1 0 2 3
print (df.columns)
Index(['NSW', 'QLD', 'VIC', 'All'], dtype='object')
If you need both column names, use list comprehension:
df.columns = ['_'.join(col) for col in df.columns]
print (df)
basic_amt_NSW basic_amt_QLD basic_amt_VIC basic_amt_All
All 1 1 2 4
Full Time 0 1 0 1
Part Time 1 0 2 3
print (df.columns)
Index(['basic_amt_NSW', 'basic_amt_QLD', 'basic_amt_VIC', 'basic_amt_All'], dtype='object')
Zip levels together
Here is an alternative solution which zips the levels together and joins them with underscore.
Derived from the above answer, and this was what I wanted to do when I found this answer. Thought I would share even if it does not answer the exact above question.
["_".join(pair) for pair in df.columns]
gives
['basic_amt_NSW', 'basic_amt_QLD', 'basic_amt_VIC', 'basic_amt_All']
Just set this as a the columns
df.columns = ["_".join(pair) for pair in df.columns]
basic_amt_NSW basic_amt_QLD basic_amt_VIC basic_amt_All
Faculty
All 1 1 2 4
Full Time 0 1 0 1
Part Time 1 0 2 3
How about simply reassigning df.columns:
levels = df.columns.levels
labels = df.columns.labels
df.columns = levels[1][labels[1]]
For example:
import pandas as pd
columns = pd.MultiIndex.from_arrays([['basic_amt']*4,
['NSW','QLD','VIC','All']])
index = pd.Index(['All', 'Full Time', 'Part Time'], name = 'Faculty')
df = pd.DataFrame([(1,1,2,4),
(0,01,0,1),
(1,0,2,3)])
df.columns = columns
df.index = index
Before:
print(df)
basic_amt
NSW QLD VIC All
Faculty
All 1 1 2 4
Full Time 0 1 0 1
Part Time 1 0 2 3
After:
levels = df.columns.levels
labels = df.columns.labels
df.columns = levels[1][labels[1]]
print(df)
NSW QLD VIC All
Faculty
All 1 1 2 4
Full Time 0 1 0 1
Part Time 1 0 2 3