I don't know if this is possible but I have a data frame like this one:
df
State County Homicides Man Woman Not_Register
Gto Celaya 2 2 0 0
NaN NaN 8 4 2 2
NaN NaN 3 2 1 0
NaN Yiriria 2 1 1 0
Nan Acambaro 1 1 0 0
Sin Culiacan 3 1 1 1
NaN Nan 5 4 0 1
Chih Juarez 1 1 0 0
I want to group by State, County, Man Women, Homicides and Not Register. Like this:
State County Homicides Man Woman Not_Register
Gto Celaya 13 8 3 2
Gto Yiriria 2 1 1 0
Gto Acambaro 1 1 0 0
Sin Culiacan 8 5 1 2
Chih Juarez 1 1 0 0
So far, I been able to group by State and County and fill the rows with NaN with the right name of the county and State. My result and code:
import numpy as np
import math
df = df.fillna(method ='pad') #To repeat the name of the State and County with the right order
#To group
df = df.groupby(["State","County"]).agg('sum')
df =df.reset_index()
df
State County Homicides
Gto Celaya 13
Gto Yiriria 2
Gto Acambaro 1
Sin Culiacan 8
Chih Juarez 1
But When I tried to add the Men and woman
df1 = df.groupby(["State","County", "Man", "Women", "Not_Register"]).agg('sum')
df1 =df.reset_index()
df1
My result is repeating the Counties not giving me a unique County for State,
How can I resolve this issue?
Thanks for your help
Change to
df[['Homicides','Man','Woman','Not_Register']]=df[['Homicides','Man','Woman','Not_Register']].apply(pd.to_numeric,errors = 'coerce')
df = df.groupby(['State',"County"]).sum().reset_index()
I created this dataframe and need to group my data into category with the same number of beds, city, baths and sort(descending) each elements in the group by price.
Secondly I need to find the difference between each price with the one ranked after into the same group.
For example the result should be like that:
1 bed, 1 bath, Madrid, 10
1 bed, 1 bath, Madrid, 8
1 bed, 1 bath, Madrid, 5
1 bed, 1 bath, Madrid, 1
I should get 2, 3, 4...
I tried some code it seems far than what I expect to find...
data=[[1,'paris',1,2,'10'],[2,'madrid',2,2,8],[3,'madrid',2,2,11],[4,'paris',1,2,6],[5,'paris',1,2,5],[6,'madrid',2,1,7],[7,'paris',2,1,7],[8,'paris',2,1,7],[9,'madrid',1,4],[10,'paris',2,1,3],[11,'madrid',2,2,7],[12,'paris',2,3,12],[13,'madrid',2,3,7],[14,'madrid',1,1,3],[15,'paris',1,1,3],[16,'madrid',1,1,4],[17,'paris',1,1,5]]
df=pd.DataFrame(data, columns=['id','city','beds','baths','price'])
df
df['gap'] = df.sort_values('price',ascending=False).groupby(['city','beds','baths'])['price'].diff()
print (df)
Many thanks in advance.
I would use pd.to_numeric with errors = 'coerce'
to get rid of the strings in the price column, I would then calculate the difference without taking into account those rooms whose price is unknown (using DataFrame.dropna). Then I show the result ordering in DataFrame and without ordering:
df['price']=pd.to_numeric(df['price'],errors = 'coerce')
df['difference_price'] = ( df.dropna()
.sort_values('price',ascending=False)
.groupby(['city','beds','baths'])['price'].diff(-1) )
or using GroupBy.shift:
df['difference_price'] = df['price'].sub( df.dropna()
.sort_values('price',ascending=False)
.groupby(['city','beds','baths'])
.price
.shift(-1) )
Display result
print(df,'\n'*3,'Sorted DatFrame: ')
print(df.sort_values(['city','beds','baths','price'],ascending = [True,True,True,False]))
Output
id city beds baths price difference_price
0 1 paris 1 2 10.0 4.0
1 2 madrid 2 2 8.0 1.0
2 3 madrid 2 2 11.0 3.0
3 4 paris 1 2 6.0 1.0
4 5 paris 1 2 5.0 NaN
5 6 madrid 2 1 7.0 NaN
6 7 paris 2 1 7.0 0.0
7 8 paris 2 1 7.0 4.0
8 9 madrid 1 4 NaN NaN
9 10 paris 2 1 3.0 NaN
10 11 madrid 2 2 7.0 NaN
11 12 paris 2 3 12.0 NaN
12 13 madrid 2 3 7.0 NaN
13 14 madrid 1 1 3.0 NaN
14 15 paris 1 1 3.0 NaN
15 16 madrid 1 1 4.0 1.0
16 17 paris 1 1 5.0 2.0
Sorted DatFrame:
id city beds baths price difference_price
15 16 madrid 1 1 4.0 1.0
13 14 madrid 1 1 3.0 NaN
8 9 madrid 1 4 NaN NaN
5 6 madrid 2 1 7.0 NaN
2 3 madrid 2 2 11.0 3.0
1 2 madrid 2 2 8.0 1.0
10 11 madrid 2 2 7.0 NaN
12 13 madrid 2 3 7.0 NaN
16 17 paris 1 1 5.0 2.0
14 15 paris 1 1 3.0 NaN
0 1 paris 1 2 10.0 4.0
3 4 paris 1 2 6.0 1.0
4 5 paris 1 2 5.0 NaN
6 7 paris 2 1 7.0 0.0
7 8 paris 2 1 7.0 4.0
9 10 paris 2 1 3.0 NaN
11 12 paris 2 3 12.0 NaN
If I understand correctly with:
group my data into category with the same number of beds, city, baths and sort(descending)
All data that does not fulfill the value should be deleted? (Where beds and baths are different). This is my code to provide an answer given your problem:
import numpy as np
import pandas as pd
data=[[1,'paris',1,2,'10'],[2,'madrid',2,2,8],[3,'madrid',2,2,11],[4,'paris',1,2,6],[5,'paris',1,2,5],[6,'madrid',2,1,7],[7,'paris',2,1,7],[8,'paris',2,1,7],[9,'madrid',1,4],[10,'paris',2,1,3],[11,'madrid',2,2,7],[12,'paris',2,3,12],[13,'madrid',2,3,7],[14,'madrid',1,1,3],[15,'paris',1,1,3],[16,'madrid',1,1,4],[17,'paris',1,1,5]]
df=pd.DataFrame(data, columns=['id','city','beds','baths','price'])
df_new = df[df['beds'] == df['baths']]
df_new = df_new.sort_values(['city','price'],ascending=[False,False]).reset_index(drop=True)
df_new['diff_price'] = df_new.groupby(['city','beds','baths'])['price'].diff(-1)
print(df_new)
Output:
id city beds baths price diff_price
0 17 paris 1 1 5 NaN
1 15 paris 1 1 3 -2
2 3 madrid 2 2 11 NaN
3 2 madrid 2 2 8 -3
4 11 madrid 2 2 7 -1
5 16 madrid 1 1 4 NaN
6 14 madrid 1 1 3 -1
i have master dataset like this
master = pd.DataFrame({'Channel':['1','1','1','1','1'],'Country':['India','Singapore','Japan','United Kingdom','Austria'],'Product':['X','6','7','X','X']})
and user table like this
user = pd.DataFrame({'User':['101','101','102','102','102','103','103','103','103','103'],'Country':['India','Brazil','India','Brazil','Japan','All','Austria','Japan','Singapore','United Kingdom'],'count':['2','1','3','2','1','1','1','1','1','1']})
i wanted master table left join with user table for each user. like below for one user
merge_101 = pd.merge(master,user[(user.User=='101')],how='left',on=['Country'])
merge_102 = pd.merge(master,user[(user.User=='102')],how='left',on=['Country'])
merge_103 = pd.merge(master,user[(user.User=='103')],how='left',on=['Country'])
merge_all = pd.concat([merge_101, merge_102,merge_103], ignore_index=True)
how to iterate each user here i am first filtering the dataset and creating another data set and appending the whole data set later.
is there any better way to do this task like for loop or any joins?
Thanks
IIUC, you need:
pd.concat([pd.merge(master,user[(user.User==x)],how='left',on=['Country']) for x in list(user['User'].unique())], ignore_index=True)
Output:
Channel Country Product User count
0 1 India X 101 2
1 1 Singapore 6 NaN NaN
2 1 Japan 7 NaN NaN
3 1 United Kingdom X NaN NaN
4 1 Austria X NaN NaN
5 1 India X 102 3
6 1 Singapore 6 NaN NaN
7 1 Japan 7 102 1
8 1 United Kingdom X NaN NaN
9 1 Austria X NaN NaN
10 1 India X NaN NaN
11 1 Singapore 6 103 1
12 1 Japan 7 103 1
13 1 United Kingdom X 103 1
14 1 Austria X 103 1
I have a DataFrame which I made into a pivot table, but now I want to order the pivot table so that common values based on a particular column are aligned beside each other. For e.g. order DataFrame so that all common countries align to same row:
data = {'dt': ['2016-08-22', '2016-08-21', '2016-08-22', '2016-08-21', '2016-08-21'],
'country':['uk', 'usa', 'fr','fr','uk'],
'number': [10, 21, 20, 10,12]
}
df = pd.DataFrame(data)
print df
country dt number
0 uk 2016-08-22 10
1 usa 2016-08-21 21
2 fr 2016-08-22 20
3 fr 2016-08-21 10
4 uk 2016-08-21 12
#pivot table by dt:
df['idx'] = df.groupby('dt')['dt'].cumcount()
df_pivot = df.set_index(['idx','dt']).stack().unstack([1,2])
print df_pivot
dt 2016-08-22 2016-08-21
country number country number
idx
0 uk 10 usa 21
1 fr 20 fr 10
2 NaN NaN uk 12
#what I really want:
dt 2016-08-22 2016-08-21
country number country number
0 uk 10 uk 12
1 fr 20 fr 10
2 NaN NaN usa 21
or even better:
2016-08-22 2016-08-21
country number number
0 uk 10 12
1 fr 20 10
2 usa NaN 21
i.e. uk values from both 2016-08-22 and 2016-08-21 are aligned on same row
You can use:
df_pivot = df.set_index(['dt','country']).stack().unstack([0,2]).reset_index()
print (df_pivot)
dt country 2016-08-22 2016-08-21
number number
0 fr 20.0 10.0
1 uk 10.0 12.0
2 usa NaN 21.0
#change first value of Multiindex from first to second level
cols = [col for col in df_pivot.columns]
df_pivot.columns = pd.MultiIndex.from_tuples([('','country')] + cols[1:])
print (df_pivot)
2016-08-22 2016-08-21
country number number
0 fr 20.0 10.0
1 uk 10.0 12.0
2 usa NaN 21.0
Another simplier solution is with pivot:
df_pivot = df.pivot(index='country', columns='dt', values='number')
print (df_pivot)
dt 2016-08-21 2016-08-22
country
fr 10.0 20.0
uk 12.0 10.0
usa 21.0 NaN