Split column values based on flag in python pandas - python

I have a table as below:
Sex weight hight
M 34 5'6"
F 29 5'1"
M 29 4'5"
F 26 5'2"
And i want to display table as below through python pandas
M F
Height Weight Height Weight
5'6" 34 5'1" 29
4'5" 29 5'2" 26
to parallel compare Male and female Ht and wt data.

Ugly but it works. The idea is to split the original DataFrame in two by sex and to recombine them with a hierarchical column index.
# Test data
df =pd.DataFrame({'Sex': ['M','F','M','F'], 'Weight': [34,29,29,26], 'Height': [5.6,5.1,4.5,5.2]})
def reshape(grouped, group):
df = grouped.get_group(group).loc[:,['Height','Weight']]
df.columns = [[group, group],df.columns]
return df.reset_index(drop=True)
grouped = df.groupby('Sex')
pd.concat([reshape(grouped,'M'), reshape(grouped,'F')], axis=1)
M F
Height Weight Height Weight
0 5.6 34 5.1 29
1 4.5 29 5.2 26

You can avoid defining a function with this:
import pandas as pd
df = pd.DataFrame({'Sex': ['M','F','M','F'], 'Weight': [34,29,29,26], 'Height': [5.6,5.1,4.5,5.2]})
gr = df.groupby('Sex')
grs = [grs for name, grs in gr]
for each in grs:
del each['Sex']
each.index = range(len(each.index))
mI = pd.MultiIndex.from_product([gr.groups.keys(), grs[0].columns])
results = pd.concat(grs, axis=1)
results.columns = mI
print results
Which prints:
M F
Height Weight Height Weight
0 5.1 29 5.6 34
1 5.2 26 4.5 29

Related

Use lambda with pandas to calculate a new column conditional on existing column

I need to create a new column in a pandas DataFrame which is calculated as the ratio of 2 existing columns in the DataFrame. However, the denominator in the ratio calculation will change based on the value of a string which is found in another column in the DataFrame.
Example. Sample dataset :
import pandas as pd
df = pd.DataFrame(data={'hand' : ['left','left','both','both'],
'exp_force' : [25,28,82,84],
'left_max' : [38,38,38,38],
'both_max' : [90,90,90,90]})
I need to create a new DataFrame column df['ratio'] based on the condition of df['hand'].
If df['hand']=='left' then df['ratio'] = df['exp_force'] / df['left_max']
If df['hand']=='both' then df['ratio'] = df['exp_force'] / df['both_max']
You can use np.where():
import pandas as pd
df = pd.DataFrame(data={'hand' : ['left','left','both','both'],
'exp_force' : [25,28,82,84],
'left_max' : [38,38,38,38],
'both_max' : [90,90,90,90]})
df['ratio'] = np.where((df['hand']=='left'), df['exp_force'] / df['left_max'], df['exp_force'] / df['both_max'])
df
Out[42]:
hand exp_force left_max both_max ratio
0 left 25 38 90 0.657895
1 left 28 38 90 0.736842
2 both 82 38 90 0.911111
3 both 84 38 90 0.933333
Alternatively, in a real-life scenario, if you have lots of conditions and results, then you can use np.select(), so that you don't have to keep repeating your np.where() statement as I have done a lot in my older code. It's better to use np.select in these situations:
import pandas as pd
df = pd.DataFrame(data={'hand' : ['left','left','both','both'],
'exp_force' : [25,28,82,84],
'left_max' : [38,38,38,38],
'both_max' : [90,90,90,90]})
c1 = (df['hand']=='left')
c2 = (df['hand']=='both')
r1 = df['exp_force'] / df['left_max']
r2 = df['exp_force'] / df['both_max']
conditions = [c1,c2]
results = [r1,r2]
df['ratio'] = np.select(conditions,results)
df
Out[430]:
hand exp_force left_max both_max ratio
0 left 25 38 90 0.657895
1 left 28 38 90 0.736842
2 both 82 38 90 0.911111
3 both 84 38 90 0.933333
Enumerate
for i,e in enumerate(df['hand']):
if e == 'left':
df.at[i,'ratio'] = df.at[i,'exp_force'] / df.at[i,'left_max']
if e == 'both':
df.at[i,'ratio'] = df.at[i,'exp_force'] / df.at[i,'both_max']
df
Output:
hand exp_force left_max both_max ratio
0 left 25 38 90 0.657895
1 left 28 38 90 0.736842
2 both 82 38 90 0.911111
3 both 84 38 90 0.933333
You can use the apply() method of your dataframe :
df['ratio'] = df.apply(
lambda x: x['exp_force'] / x['left_max'] if x['hand']=='left' else x['exp_force'] / x['both_max'],
axis=1
)

How to create a dataframe with simulated data in python

I have sample schema, which consists 12 columns, and each column has certain category. Now i need to simulate those data into a dataframe of around 1000 rows. How do i go about it?
I have used below code to generate data for each column
Location = ['USA','India','Prague','Berlin','Dubai','Indonesia','Vienna']
Location = random.choice(Location)
Age = ['Under 18','Between 18 and 64','65 and older']
Age = random.choice(Age)
Gender = ['Female','Male','Other']
Gender = random.choice(Gender)
and so on
I need the output as below
Location Age Gender
Dubai below 18 Female
India 65 and older Male
.
.
.
.
You can create each column one by one using np.random.choice:
df = pd.DataFrame()
N = 1000
df["Location"] = np.random.choice(Location, size=N)
df["Age"] = np.random.choice(Age, size=N)
df["Gender"] = np.random.choice(Gender, size=N)
Or do that using a list comprehension:
column_to_choice = {"Location": Location, "Age": Age, "Gender": Gender}
df = pd.DataFrame(
[np.random.choice(column_to_choice[c], 100) for c in column_to_choice]
).T
df.columns = list(column_to_choice.keys())
Result:
>>> print(df.head())
Location Age Gender
0 India 65 and older Female
1 Berlin Between 18 and 64 Female
2 USA Between 18 and 64 Male
3 Indonesia Under 18 Male
4 Dubai Under 18 Other
You can create a for loop for the number of rows you want in your dataframe and then generate a list of dictionary. Use the list of dictionary to generate the dataframe.
In [16]: for i in range(5):
...: k={}
...: loc = random.choice(Location)
...: age = random.choice(Age)
...: gen = random.choice(Gender)
...: k = {'Location':loc,'Age':age, 'Gender':gen}
...: list2.append(k)
...:
In [17]: import pandas as pd
In [18]: df = pd.DataFrame(list2)
In [19]: df
Out[19]:
Age Gender Location
0 Between 18 and 64 Other Berlin
1 65 and older Other USA
2 65 and older Male Dubai
3 Between 18 and 64 Male Dubai
4 Between 18 and 64 Male Indonesia

How to get the rolling mean and find the percentage of Male and Female in each occupation?

occupation gender number
administrator F 36
M 43
artist F 13
M 15
doctor M 7
educator F 26
M 69
How to get the rolling mean of first 2 column and find the average of (M)male and (F)female in each occupation
users = pd.read_table('https://raw.githubusercontent.com/justmarkham/DAT8/master/data/u.user',
sep='|', index_col='user_id')
users.head()
age gender occupation zip_code
user_id
1 24 M technician 85711
2 53 F other 94043
3 23 M writer 32067
4 24 M technician 43537
5 33 F other 15213
# create a data frame and apply count to gender
gender_ocup = users.groupby(['occupation', 'gender']).agg({'gender': 'count'})
# create a DataFrame and apply count for each occupation
occup_count = users.groupby(['occupation']).agg('count')
# divide the gender_ocup per the occup_count and multiply per 100
occup_gender = gender_ocup.div(occup_count, level = "occupation") * 100
# present all rows from the 'gender column'
occup_gender.loc[: , 'gender']
courtesy
https://github.com/guipsamora/pandas_exercises/blob/master/03_Grouping/Occupation/Exercises_with_solutions.ipynb

Python : Align the output of dataframe to centre of screen/console

I have written below function in python:
def proc_summ(df,var_names_in,var_names_group):
df['Freq']=1
df_summed=pd.pivot_table(df,index=(var_names_group),
values=(var_names_in),
aggfunc=[np.sum],fill_value=0,margins=True,margins_name='Total').reset_index()
df_summed.columns = df_summed.columns.map(''.join)
df_summed.columns = [x.strip().replace('sum', '') for x in df_summed.columns]
string_repr = df_summed.to_string(index=False,justify='center').splitlines()
string_repr.insert(1, "-" * len(string_repr[0]))
string_repr.insert(len(df_summed.index)+1, "-" * len(string_repr[0]))
out = '\n'.join(string_repr)
print(out)
And below is the code I am using to call the function:
proc_summ (
df,
var_names_in=["Freq","sal"] ,
var_names_group=["name","age"])
and below is the output:
name age Freq sal
--------------------
Arik 32 1 100
David 44 2 260
John 33 1 200
John 34 1 300
Peter 33 1 100
--------------------
Total 6 960
Please let me know how can I print the data to the center of the screen like :
name age Freq sal
--------------------
Arik 32 1 100
David 44 2 260
John 33 1 200
John 34 1 300
Peter 33 1 100
--------------------
Total 6 960
If you are using Python3 you can try something like this
import shutil
columns = shutil.get_terminal_size().columns
print("hello world".center(columns))
As You are Using DataFrame you can try something like this
import shutil
import pandas as pd
data = {'col1': [1, 2], 'col2': [3, 4]}
df = pd.DataFrame(data)
# convert DataFrame to string
df_string = df.to_string()
df_split = df_string.split('\n')
columns = shutil.get_terminal_size().columns
for i in range(len(df)):
print(df_split[i].center(columns))

Specifying column order following groupby aggregation

The ordering of my age, height and weight columns is changing with each run of the code. I need to keep the order of my agg columns static because I ultimately refer to this output file according to the column locations. What can I do to make sure age, height and weight are output in the same order every time?
d = pd.read_csv(input_file, na_values=[''])
df = pd.DataFrame(d)
df.index_col = ['name', 'address']
df_out = df.groupby(df.index_col).agg({'age':np.mean, 'height':np.sum, 'weight':np.sum})
df_out.to_csv(output_file, sep=',')
I think you can use subset:
df_out = df.groupby(df.index_col)
.agg({'age':np.mean, 'height':np.sum, 'weight':np.sum})[['age','height','weight']]
Also you can use pandas functions:
df_out = df.groupby(df.index_col)
.agg({'age':'mean', 'height':sum, 'weight':sum})[['age','height','weight']]
Sample:
df = pd.DataFrame({'name':['q','q','a','a'],
'address':['a','a','s','s'],
'age':[7,8,9,10],
'height':[1,3,5,7],
'weight':[5,3,6,8]})
print (df)
address age height name weight
0 a 7 1 q 5
1 a 8 3 q 3
2 s 9 5 a 6
3 s 10 7 a 8
df.index_col = ['name', 'address']
df_out = df.groupby(df.index_col)
.agg({'age':'mean', 'height':sum, 'weight':sum})[['age','height','weight']]
print (df_out)
age height weight
name address
a s 9.5 12 14
q a 7.5 4 8
EDIT by suggestion - add reset_index, here as_index=False does not work if need index values too:
df_out = df.groupby(df.index_col)
.agg({'age':'mean', 'height':sum, 'weight':sum})[['age','height','weight']]
.reset_index()
print (df_out)
name address age height weight
0 a s 9.5 12 14
1 q a 7.5 4 8
If you care mostly about the order when written to a file and not while its still in a DataFrame object, you can set the columns parameter of the to_csv() method:
>>> df = pd.DataFrame(
{'age': [28,63,28,45],
'height': [183,156,170,201],
'weight': [70.2, 62.5, 65.9, 81.0],
'name': ['Kim', 'Pat', 'Yuu', 'Sacha']},
columns=['name','age','weight', 'height'])
>>> df
name age weight height
0 Kim 28 70.2 183
1 Pat 63 62.5 156
2 Yuu 28 65.9 170
3 Sacha 45 81.0 201
>>> df_out = df.groupby(['age'], as_index=False).agg(
{'weight': sum, 'height': sum})
>>> df_out
age height weight
0 28 353 136.1
1 45 201 81.0
2 63 156 62.5
>>> df_out.to_csv('out.csv', sep=',', columns=['age','height','weight'])
out.csv then looks like this:
,age,height,weight
0,28,353,136.10000000000002
1,45,201,81.0
2,63,156,62.5

Categories