Given the following data frame:
Name Telephone Telephone2
0 Bill Gates 555 77 854 555 77 855
1 Bill Gates2 2 3
How can I have exactly the following output(printing column by column)?
Name
Bill Gates
Bill Gates2
Telephone
555 77 854
2
Telephone2
555 77 855
3
I tried:
for key, val in df.iterrows():
for sub_val in val:
print(sub_val)
But I get:
Bill Gates
555 77 854
555 77 855
Bill Gates2
2
3
You can use:
for col in df:
print(col)
print('\n'.join(df[col]))
Output:
Name
Bill Gates
Bill Gates2
Telephone
555 77 854
2
Telephone2
555 77 855
3
I have a dataset in Python that I am trying to convert from a wide dataset like this:
ID
Name
2007
2008
1
Andy
324
412
2
Becky
123
422
3
Lizzie
332
564
To a long dataset such as this.
ID
Name
Year
Var
1
Andy
2007
324
1
Andy
2008
412
2
Becky
2007
123
2
Becky
2008
422
3
Lizzie
2007
332
3
Lizzie
2008
564
Unfortunately can't use pivot due to the two identification columns and multiple observations for each year. Any help would be much appreciated.
Can't use pivot because this is actually a melt operation:
out = (df.melt(
id_vars=["ID", "Name"],
value_vars=["2007", "2008"],
var_name="Year",
value_name="Var"
)
.sort_values(["ID", "Year"]))
print(out)
ID Name Year Var
0 1 Andy 2007 324
3 1 Andy 2008 412
1 2 Becky 2007 123
4 2 Becky 2008 422
2 3 Lizzie 2007 332
5 3 Lizzie 2008 564
I have a dataframe as below.
I want to groupby 'user' & 'eve' and sum 'Ses' till 100/200 & from 100 to 200.
Also, return the value of column 'Name' where 100/200 occurs.
If after an hundred, there is no 100 or 200 (like last row in group a & 123 or a & 456), ignore it.
User eve Ses ID Name
a 123 1 10 a
a 123 2 11 a
a 123 3 12 a
a 123 4 13 a
a 123 3 100 xyz
a 123 6 10 a
a 456 1 11 a
a 456 2 12 a
a 456 3 13 a
a 456 4 40 a
a 456 1 100 mno
a 456 14 10 a
a 456 7 20 a
a 456 8 30 a
a 456 12 200 pqr
a 456 10 10 a
b 123 1 20 a
b 123 2 30 a
b 123 3 40 a
b 123 4 50 a
b 123 1 70 a
b 123 6 100 abc
b 888 1 20 a
b 888 1 200 jkl
b 888 3 10 a
b 888 4 20 a
b 888 5 30 a
b 888 1 100 rrr
b 888 7 50 a
b 888 8 70 a
The expected output for the above input df is a df below.
User eve Ses Name
a 123 13 xyz
a 456 11 mno
a 456 41 pqr
b 123 17 abc
b 888 2 jkl
b 888 13 rrr
This is my approach:
# valid IDs
df['valids'] = df['ID'].isin([100,200])
# mask the trailing non-hundred ids
heads = (df['ID'].where(df['valids'])
.groupby([df['User'],df['eve']])
.bfill().notnull()
)
df = df[heads]
# groupby and output:
(df.groupby(['User','eve', df['valids'].shift(fill_value=0).cumsum()],
as_index=False)
.agg({'Ses':'sum', 'Name':'last'})
)
Output:
User eve Ses Name
0 a 123 13 xyz
1 a 456 11 mno
2 a 456 41 pqr
3 b 123 17 abc
4 b 888 2 jkl
5 b 888 13 rrr
I have a dataframe - df as below :
Stud_id card Nation Gender Age Code Amount yearmonth
111 1 India M Adult 543 100 201601
111 1 India M Adult 543 100 201601
111 1 India M Adult 543 150 201602
111 1 India M Adult 612 100 201602
111 1 India M Adult 715 200 201603
222 2 India M Adult 715 200 201601
222 2 India M Adult 543 100 201604
222 2 India M Adult 543 100 201603
333 3 India M Adult 543 100 201601
333 3 India M Adult 543 100 201601
333 4 India M Adult 543 150 201602
333 4 India M Adult 612 100 201607
Now, I want two dataframes as below :
df_1 :
card Code Total_Amount Avg_Amount
1 543 350 175
2 543 200 100
3 543 200 200
4 543 150 150
1 612 100 100
4 612 100 100
1 715 200 200
2 715 200 200
Logic for df_1 :
1. Total_Amount : For each unique card and unique Code get the sum of amount ( For eg : card : 1 , Code : 543 = 350 )
2. Avg_Amount: Divide the Total amount by no.of unique yearmonth for each unique card and unique Code ( For eg : Total_Amount = 350, No. Of unique yearmonth is 2 = 175
df_2 :
Code Avg_Amount
543 156.25
612 100
715 200
Logic for df_2 :
1. Avg_Amount: Sum of Avg_Amount of each Code in df_1 (For eg. Code:543 the Sum of Avg_Amount is 175+100+200+150 = 625. Divide it by no.of rows - 4. So 625/4 = 156.25
Code to create the data frame - df :
df=pd.DataFrame({'Cus_id': (111,111,111,111,111,222,222,222,333,333,333,333),
'Card': (1,1,1,1,1,2,2,2,3,3,4,4),
'Nation':('India','India','India','India','India','India','India','India','India','India','India','India'),
'Gender': ('M','M','M','M','M','M','M','M','M','M','M','M'),
'Age':('Adult','Adult','Adult','Adult','Adult','Adult','Adult','Adult','Adult','Adult','Adult','Adult'),
'Code':(543,543,543,612,715,715,543,543,543,543,543,612),
'Amount': (100,100,150,100,200,200,100,100,100,100,150,100),
'yearmonth':(201601,201601,201602,201602,201603,201601,201604,201603,201601,201601,201602,201607)})
Code to get the required df_2 :
df1 = df_toy.groupby(['Card','Code'])['yearmonth','Amount'].apply(
lambda x: [sum(x.Amount),sum(x.Amount)/len(set(x.yearmonth))]).apply(
pd.Series).reset_index()
df1.columns= ['Card','Code','Total_Amount','Avg_Amount']
df2 = df1.groupby('Code')['Avg_Amount'].apply(lambda x: sum(x)/len(x)).reset_index(
name='Avg_Amount')
Though the code works fine, since my dataset is huge its taking time. I am looking for the optimized code ? I think apply function is taking time ? Is there a better optimized code pls ?
For DataFrame 1 you can do this:
tmp = df.groupby(['Card', 'Code'], as_index=False) \
.agg({'Amount': 'sum', 'yearmonth': pd.Series.nunique})
df1 = tmp.assign(Avg_Amount=tmp.Amount / tmp.yearmonth) \
.drop(columns=['yearmonth'])
Card Code Amount Avg_Amount
0 1 543 350 175.0
1 1 612 100 100.0
2 1 715 200 200.0
3 2 543 200 100.0
4 2 715 200 200.0
5 3 543 200 200.0
6 4 543 150 150.0
7 4 612 100 100.0
For DataFrame 2 you can do this:
df1.groupby('Code', as_index=False) \
.agg({'Avg_Amount': 'mean'})
Code Avg_Amount
0 543 156.25
1 612 100.00
2 715 200.00
I'm using Pandas 0.19.
Considering the following data frame:
FID admin0 admin1 admin2 windspeed population
0 cntry1 state1 city1 60km/h 700
1 cntry1 state1 city1 90km/h 210
2 cntry1 state1 city2 60km/h 100
3 cntry1 state2 city3 60km/h 70
4 cntry1 state2 city4 60km/h 180
5 cntry1 state2 city4 90km/h 370
6 cntry2 state3 city5 60km/h 890
7 cntry2 state3 city6 60km/h 120
8 cntry2 state3 city6 90km/h 420
9 cntry2 state3 city6 120km/h 360
10 cntry2 state4 city7 60km/h 740
How can I create a table like this one?
population
60km/h 90km/h 120km/h
admin0 admin1 admin2
cntry1 state1 city1 700 210 0
cntry1 state1 city2 100 0 0
cntry1 state2 city3 70 0 0
cntry1 state2 city4 180 370 0
cntry2 state3 city5 890 0 0
cntry2 state3 city6 120 420 360
cntry2 state4 city7 740 0 0
I have tried with the following pivot table:
table = pd.pivot_table(df,index=["admin0","admin1","admin2"], columns=["windspeed"], values=["population"],fill_value=0)
In general it works great, but unfortunately I am not able to sort the new columns in the right order: the 120km/h column appears before the ones for 60km/h and 90km/h. How can I specify the order of the new columns?
Moreover, as a second step I need to add subtotals both for admin0 and admin1. Ideally, the table I need should be like this:
population
60km/h 90km/h 120km/h
admin0 admin1 admin2
cntry1 state1 city1 700 210 0
cntry1 state1 city2 100 0 0
SUM state1 800 210 0
cntry1 state2 city3 70 0 0
cntry1 state2 city4 180 370 0
SUM state2 250 370 0
SUM cntry1 1050 580 0
cntry2 state3 city5 890 0 0
cntry2 state3 city6 120 420 360
SUM state3 1010 420 360
cntry2 state4 city7 740 0 0
SUM state4 740 0 0
SUM cntry2 1750 420 360
SUM ALL 2800 1000 360
you can do it using reindex() method and custom sorting:
In [26]: table
Out[26]:
population
windspeed 120km/h 60km/h 90km/h
admin0 admin1 admin2
cntry1 state1 city1 0 700 210
city2 0 100 0
state2 city3 0 70 0
city4 0 180 370
cntry2 state3 city5 0 890 0
city6 360 120 420
state4 city7 0 740 0
In [27]: cols = sorted(table.columns.tolist(), key=lambda x: int(x[1].replace('km/h','')))
In [28]: cols
Out[28]: [('population', '60km/h'), ('population', '90km/h'), ('population', '120km/h')]
In [29]: table = table.reindex(columns=cols)
In [30]: table
Out[30]:
population
windspeed 60km/h 90km/h 120km/h
admin0 admin1 admin2
cntry1 state1 city1 700 210 0
city2 100 0 0
state2 city3 70 0 0
city4 180 370 0
cntry2 state3 city5 890 0 0
city6 120 420 360
state4 city7 740 0 0
Solution with subtotals and MultiIndex.from_arrays. Last concat and all Dataframes, sort_index and add all sum:
#replace km/h and convert to int
df.windspeed = df.windspeed.str.replace('km/h','').astype(int)
print (df)
FID admin0 admin1 admin2 windspeed population
0 0 cntry1 state1 city1 60 700
1 1 cntry1 state1 city1 90 210
2 2 cntry1 state1 city2 60 100
3 3 cntry1 state2 city3 60 70
4 4 cntry1 state2 city4 60 180
5 5 cntry1 state2 city4 90 370
6 6 cntry2 state3 city5 60 890
7 7 cntry2 state3 city6 60 120
8 8 cntry2 state3 city6 90 420
9 9 cntry2 state3 city6 120 360
10 10 cntry2 state4 city7 60 740
#pivoting
table = pd.pivot_table(df,
index=["admin0","admin1","admin2"],
columns=["windspeed"],
values=["population"],
fill_value=0)
print (table)
population
windspeed 60 90 120
admin0 admin1 admin2
cntry1 state1 city1 700 210 0
city2 100 0 0
state2 city3 70 0 0
city4 180 370 0
cntry2 state3 city5 890 0 0
city6 120 420 360
state4 city7 740 0 0
#groupby and create sum dataframe by levels 0,1
df1 = table.groupby(level=[0,1]).sum()
df1.index = pd.MultiIndex.from_arrays([df1.index.get_level_values(0),
df1.index.get_level_values(1)+ '_sum',
len(df1.index) * ['']])
print (df1)
population
windspeed 60 90 120
admin0
cntry1 state1_sum 800 210 0
state2_sum 250 370 0
cntry2 state3_sum 1010 420 360
state4_sum 740 0 0
df2 = table.groupby(level=0).sum()
df2.index = pd.MultiIndex.from_arrays([df2.index.values + '_sum',
len(df2.index) * [''],
len(df2.index) * ['']])
print (df2)
population
windspeed 60 90 120
cntry1_sum 1050 580 0
cntry2_sum 1750 420 360
#concat all dataframes together, sort index
df = pd.concat([table, df1, df2]).sort_index(level=[0])
#add km/h to second level in columns
df.columns = pd.MultiIndex.from_arrays([df.columns.get_level_values(0),
df.columns.get_level_values(1).astype(str) + 'km/h'])
#add all sum
df.loc[('All_sum','','')] = table.sum().values
print (df)
population
60km/h 90km/h 120km/h
admin0 admin1 admin2
cntry1 state1 city1 700 210 0
city2 100 0 0
state1_sum 800 210 0
state2 city3 70 0 0
city4 180 370 0
state2_sum 250 370 0
cntry1_sum 1050 580 0
cntry2 state3 city5 890 0 0
city6 120 420 360
state3_sum 1010 420 360
state4 city7 740 0 0
state4_sum 740 0 0
cntry2_sum 1750 420 360
All_sum 2800 1000 360
EDIT by comment:
def f(x):
print (x)
if (len(x) > 1):
return x.sum()
df1 = table.groupby(level=[0,1]).apply(f).dropna(how='all')
df1.index = pd.MultiIndex.from_arrays([df1.index.get_level_values(0),
df1.index.get_level_values(1)+ '_sum',
len(df1.index) * ['']])
print (df1)
population
windspeed 60 90 120
admin0
cntry1 state1_sum 800.0 210.0 0.0
state2_sum 250.0 370.0 0.0
cntry2 state3_sum 1010.0 420.0 360.0