Specifying column order following groupby aggregation - python

The ordering of my age, height and weight columns is changing with each run of the code. I need to keep the order of my agg columns static because I ultimately refer to this output file according to the column locations. What can I do to make sure age, height and weight are output in the same order every time?
d = pd.read_csv(input_file, na_values=[''])
df = pd.DataFrame(d)
df.index_col = ['name', 'address']
df_out = df.groupby(df.index_col).agg({'age':np.mean, 'height':np.sum, 'weight':np.sum})
df_out.to_csv(output_file, sep=',')

I think you can use subset:
df_out = df.groupby(df.index_col)
.agg({'age':np.mean, 'height':np.sum, 'weight':np.sum})[['age','height','weight']]
Also you can use pandas functions:
df_out = df.groupby(df.index_col)
.agg({'age':'mean', 'height':sum, 'weight':sum})[['age','height','weight']]
Sample:
df = pd.DataFrame({'name':['q','q','a','a'],
'address':['a','a','s','s'],
'age':[7,8,9,10],
'height':[1,3,5,7],
'weight':[5,3,6,8]})
print (df)
address age height name weight
0 a 7 1 q 5
1 a 8 3 q 3
2 s 9 5 a 6
3 s 10 7 a 8
df.index_col = ['name', 'address']
df_out = df.groupby(df.index_col)
.agg({'age':'mean', 'height':sum, 'weight':sum})[['age','height','weight']]
print (df_out)
age height weight
name address
a s 9.5 12 14
q a 7.5 4 8
EDIT by suggestion - add reset_index, here as_index=False does not work if need index values too:
df_out = df.groupby(df.index_col)
.agg({'age':'mean', 'height':sum, 'weight':sum})[['age','height','weight']]
.reset_index()
print (df_out)
name address age height weight
0 a s 9.5 12 14
1 q a 7.5 4 8

If you care mostly about the order when written to a file and not while its still in a DataFrame object, you can set the columns parameter of the to_csv() method:
>>> df = pd.DataFrame(
{'age': [28,63,28,45],
'height': [183,156,170,201],
'weight': [70.2, 62.5, 65.9, 81.0],
'name': ['Kim', 'Pat', 'Yuu', 'Sacha']},
columns=['name','age','weight', 'height'])
>>> df
name age weight height
0 Kim 28 70.2 183
1 Pat 63 62.5 156
2 Yuu 28 65.9 170
3 Sacha 45 81.0 201
>>> df_out = df.groupby(['age'], as_index=False).agg(
{'weight': sum, 'height': sum})
>>> df_out
age height weight
0 28 353 136.1
1 45 201 81.0
2 63 156 62.5
>>> df_out.to_csv('out.csv', sep=',', columns=['age','height','weight'])
out.csv then looks like this:
,age,height,weight
0,28,353,136.10000000000002
1,45,201,81.0
2,63,156,62.5

Related

Pandas: index-derived column with specific increments based on other columns

I have the following data frame:
import pandas as pd
pandas_df = pd.DataFrame([
["SEX", "Male"],
["SEX", "Female"],
["EXACT_AGE", None],
["Country", "Afghanistan"],
["Country", "Albania"]],
columns=['FullName', 'ResponseLabel'
])
Now what I need to do is to add sort order to this dataframe. Each new "FullName" would increment it by 100 and each consecutive "ResponseLabel" for a given "FullName" would increment it by 1 (for this specific "FullName"). So I basically create two different sort orders that I sum later on.
pandas_full_name_increment = pandas_df[['FullName']].drop_duplicates()
pandas_full_name_increment = pandas_full_name_increment.reset_index()
pandas_full_name_increment.index += 1
pandas_full_name_increment['SortOrderFullName'] = pandas_full_name_increment.index * 100
pandas_df['SortOrderResponseLabel'] = pandas_df.groupby(['FullName']).cumcount() + 1
pandas_df = pd.merge(pandas_df, pandas_full_name_increment, on = ['FullName'], how = 'left')
Result:
FullName ResponseLabel SortOrderResponseLabel index SortOrderFullName SortOrder
0 SEX Male 1 0 100 101
1 SEX Female 2 0 100 102
2 EXACT_AGE NULL 1 2 200 201
3 Country Afghanistan 1 3 300 301
4 Country Albania 2 3 300 302
The result that I get on my "SortOrder" column is correct but I wonder if there is some better approach pandas-wise?
Thank you!
The best way to do this would be to use ngroup and cumcount
name_group = pandas_df.groupby('FullName')
pandas_df['sort_order'] = (
name_group.ngroup(ascending=False).add(1).mul(100) +
name_group.cumcount().add(1)
)
Output
FullName ResponseLabel sort_order
0 SEX Male 101
1 SEX Female 102
2 EXACT_AGE None 201
3 Country Afghanistan 301
4 Country Albania 302

How to create a dataframe with simulated data in python

I have sample schema, which consists 12 columns, and each column has certain category. Now i need to simulate those data into a dataframe of around 1000 rows. How do i go about it?
I have used below code to generate data for each column
Location = ['USA','India','Prague','Berlin','Dubai','Indonesia','Vienna']
Location = random.choice(Location)
Age = ['Under 18','Between 18 and 64','65 and older']
Age = random.choice(Age)
Gender = ['Female','Male','Other']
Gender = random.choice(Gender)
and so on
I need the output as below
Location Age Gender
Dubai below 18 Female
India 65 and older Male
.
.
.
.
You can create each column one by one using np.random.choice:
df = pd.DataFrame()
N = 1000
df["Location"] = np.random.choice(Location, size=N)
df["Age"] = np.random.choice(Age, size=N)
df["Gender"] = np.random.choice(Gender, size=N)
Or do that using a list comprehension:
column_to_choice = {"Location": Location, "Age": Age, "Gender": Gender}
df = pd.DataFrame(
[np.random.choice(column_to_choice[c], 100) for c in column_to_choice]
).T
df.columns = list(column_to_choice.keys())
Result:
>>> print(df.head())
Location Age Gender
0 India 65 and older Female
1 Berlin Between 18 and 64 Female
2 USA Between 18 and 64 Male
3 Indonesia Under 18 Male
4 Dubai Under 18 Other
You can create a for loop for the number of rows you want in your dataframe and then generate a list of dictionary. Use the list of dictionary to generate the dataframe.
In [16]: for i in range(5):
...: k={}
...: loc = random.choice(Location)
...: age = random.choice(Age)
...: gen = random.choice(Gender)
...: k = {'Location':loc,'Age':age, 'Gender':gen}
...: list2.append(k)
...:
In [17]: import pandas as pd
In [18]: df = pd.DataFrame(list2)
In [19]: df
Out[19]:
Age Gender Location
0 Between 18 and 64 Other Berlin
1 65 and older Other USA
2 65 and older Male Dubai
3 Between 18 and 64 Male Dubai
4 Between 18 and 64 Male Indonesia

Grouping column values together

I have a dataframe like so:
Class price demand
1 22 8
1 60 7
3 32 14
2 72 9
4 45 20
5 42 25
What I'd like to do is group classes 1-3 in one category and classes 4-5 in one category. Then I'd like to get the sum of price for each category and the sum of demand for each category. I'd like to also get the mean. The result should look something like this:
Class TotalPrice TotalDemand AveragePrice AverageDemand
P 186 38 46.5 9.5
E 87 45 43.5 22.5
Where P is classes 1-3 and E is classes 4-5. How can I group by categories in pandas? Is there a way to do this?
In [8]: df.groupby(np.where(df['Class'].isin([1, 2, 3]), 'P', 'E'))[['price', 'demand']].agg(['sum', 'mean'])
Out[8]:
price demand
sum mean sum mean
E 87 43.5 45 22.5
P 186 46.5 38 9.5
You can create a dictionary that defines your groups.
mapping = {**dict.fromkeys([1, 2, 3], 'P'), **dict.fromkeys([4, 5], 'E')}
Then if you pass a dictionary or callable to a groupby it automatically gets mapped onto the index. So, let's set the index to Class
d = df.set_index('Class').groupby(mapping).agg(['sum', 'mean']).sort_index(1, 1)
Finally, we do some tweaking to get column names the way you specified.
rename_dict = {'sum': 'Total', 'mean': 'Average'}
d.columns = d.columns.map(lambda c: f"{rename_dict[c[1]]}{c[0].title()}")
d.rename_axis('Class').reset_index()
Class TotalPrice TotalDemand AveragePrice AverageDemand
0 E 87 45 43.5 22.5
1 P 186 38 46.5 9.5
In general, you can form arbitrary bins to group your data using pd.cut, specifying the right bin edges:
import pandas as pd
pd.cut(df.Class, bins=[0, 3, 5], labels=['P', 'E'])
#0 P
#1 P
#2 P
#3 P
#4 E
#5 E
df2 = (df.groupby(pd.cut(df.Class, bins=[0,3,5], labels=['P', 'E']))[['demand', 'price']]
.agg({'sum', 'mean'}).reset_index())
# Get rid of the multi-level columns
df2.columns = [f'{i}_{j}' if j != '' else f'{i}' for i,j in df2.columns]
Output:
Class demand_sum demand_mean price_sum price_mean
0 P 38 9.5 186 46.5
1 E 45 22.5 87 43.5

Python : Align the output of dataframe to centre of screen/console

I have written below function in python:
def proc_summ(df,var_names_in,var_names_group):
df['Freq']=1
df_summed=pd.pivot_table(df,index=(var_names_group),
values=(var_names_in),
aggfunc=[np.sum],fill_value=0,margins=True,margins_name='Total').reset_index()
df_summed.columns = df_summed.columns.map(''.join)
df_summed.columns = [x.strip().replace('sum', '') for x in df_summed.columns]
string_repr = df_summed.to_string(index=False,justify='center').splitlines()
string_repr.insert(1, "-" * len(string_repr[0]))
string_repr.insert(len(df_summed.index)+1, "-" * len(string_repr[0]))
out = '\n'.join(string_repr)
print(out)
And below is the code I am using to call the function:
proc_summ (
df,
var_names_in=["Freq","sal"] ,
var_names_group=["name","age"])
and below is the output:
name age Freq sal
--------------------
Arik 32 1 100
David 44 2 260
John 33 1 200
John 34 1 300
Peter 33 1 100
--------------------
Total 6 960
Please let me know how can I print the data to the center of the screen like :
name age Freq sal
--------------------
Arik 32 1 100
David 44 2 260
John 33 1 200
John 34 1 300
Peter 33 1 100
--------------------
Total 6 960
If you are using Python3 you can try something like this
import shutil
columns = shutil.get_terminal_size().columns
print("hello world".center(columns))
As You are Using DataFrame you can try something like this
import shutil
import pandas as pd
data = {'col1': [1, 2], 'col2': [3, 4]}
df = pd.DataFrame(data)
# convert DataFrame to string
df_string = df.to_string()
df_split = df_string.split('\n')
columns = shutil.get_terminal_size().columns
for i in range(len(df)):
print(df_split[i].center(columns))

Split column values based on flag in python pandas

I have a table as below:
Sex weight hight
M 34 5'6"
F 29 5'1"
M 29 4'5"
F 26 5'2"
And i want to display table as below through python pandas
M F
Height Weight Height Weight
5'6" 34 5'1" 29
4'5" 29 5'2" 26
to parallel compare Male and female Ht and wt data.
Ugly but it works. The idea is to split the original DataFrame in two by sex and to recombine them with a hierarchical column index.
# Test data
df =pd.DataFrame({'Sex': ['M','F','M','F'], 'Weight': [34,29,29,26], 'Height': [5.6,5.1,4.5,5.2]})
def reshape(grouped, group):
df = grouped.get_group(group).loc[:,['Height','Weight']]
df.columns = [[group, group],df.columns]
return df.reset_index(drop=True)
grouped = df.groupby('Sex')
pd.concat([reshape(grouped,'M'), reshape(grouped,'F')], axis=1)
M F
Height Weight Height Weight
0 5.6 34 5.1 29
1 4.5 29 5.2 26
You can avoid defining a function with this:
import pandas as pd
df = pd.DataFrame({'Sex': ['M','F','M','F'], 'Weight': [34,29,29,26], 'Height': [5.6,5.1,4.5,5.2]})
gr = df.groupby('Sex')
grs = [grs for name, grs in gr]
for each in grs:
del each['Sex']
each.index = range(len(each.index))
mI = pd.MultiIndex.from_product([gr.groups.keys(), grs[0].columns])
results = pd.concat(grs, axis=1)
results.columns = mI
print results
Which prints:
M F
Height Weight Height Weight
0 5.1 29 5.6 34
1 5.2 26 4.5 29

Categories