I don't know if this is possible but I have a data frame like this one:
df
State County Homicides Man Woman Not_Register
Gto Celaya 2 2 0 0
NaN NaN 8 4 2 2
NaN NaN 3 2 1 0
NaN Yiriria 2 1 1 0
Nan Acambaro 1 1 0 0
Sin Culiacan 3 1 1 1
NaN Nan 5 4 0 1
Chih Juarez 1 1 0 0
I want to group by State, County, Man Women, Homicides and Not Register. Like this:
State County Homicides Man Woman Not_Register
Gto Celaya 13 8 3 2
Gto Yiriria 2 1 1 0
Gto Acambaro 1 1 0 0
Sin Culiacan 8 5 1 2
Chih Juarez 1 1 0 0
So far, I been able to group by State and County and fill the rows with NaN with the right name of the county and State. My result and code:
import numpy as np
import math
df = df.fillna(method ='pad') #To repeat the name of the State and County with the right order
#To group
df = df.groupby(["State","County"]).agg('sum')
df =df.reset_index()
df
State County Homicides
Gto Celaya 13
Gto Yiriria 2
Gto Acambaro 1
Sin Culiacan 8
Chih Juarez 1
But When I tried to add the Men and woman
df1 = df.groupby(["State","County", "Man", "Women", "Not_Register"]).agg('sum')
df1 =df.reset_index()
df1
My result is repeating the Counties not giving me a unique County for State,
How can I resolve this issue?
Thanks for your help
Change to
df[['Homicides','Man','Woman','Not_Register']]=df[['Homicides','Man','Woman','Not_Register']].apply(pd.to_numeric,errors = 'coerce')
df = df.groupby(['State',"County"]).sum().reset_index()
Related
I am looking to calculate the changed in mental health scores of each individual between two timepoints.
Each user has a name, and a mental health score from 3 different timepoints. I would like to calculate the change in mental health score between timepoint 3 and 1
Below is example of df I'm starting with:
User Timepoint Mental Health Score
Bill 1 5
Bill 2 10
Bill 3 15
Wiz 1 10
Wiz 2 10
Wiz 3 15
Sam 1 5
Sam 2 5
Sam 3 5
This is desired output:
User Timepoint Mental Health Score Change in Mental Health (TP1 and 3)
Bill 1 5
Bill 2 10
Bill 3 15 10
Wiz 1 10
Wiz 2 10
Wiz 3 15 5
Sam 1 5
Sam 2 5
Sam 3 5 0
Does anyone know how to do this?
You can accomplish this using shift() and np.where()
df['Change in Mental Health (TP1 and 3)'] = df['Mental Health Score'] - df['Mental Health Score'].shift(2)
df['Change in Mental Health (TP1 and 3)'] = np.where(df['Timepoint'] != 3, 0, df['Change in Mental Health (TP1 and 3)']).astype(int)
df
Try with groupby and where:
#sort by Timepoint if needed
#df = df.sort_values("Timepoint")
changes = df.groupby("User")["Mental Health Score"].transform('last')-df.groupby("User")["Mental Health Score"].transform('first')
df["Change"] = changes.where(df["Timepoint"].eq(3))
>>> df
User Timepoint Mental Health Score Change
0 Bill 1 5 NaN
1 Bill 2 10 NaN
2 Bill 3 15 10.0
3 Wiz 1 10 NaN
4 Wiz 2 10 NaN
5 Wiz 3 15 5.0
6 Sam 1 5 NaN
7 Sam 2 5 NaN
8 Sam 3 5 0.0
As it is stated already in the comments you can groupby your dataframe on User and calculate difference on Mental Health Score
I put a snippet code here to demonstrate
def _overall_change(scores):
return scores.iloc[-1] - scores.iloc[0]
person = df.groupby('User')['Score'].agg(_overall_change)
Using groupby and a merge:
g = df.sort_values(by='Timepoint').groupby('User')['Mental Health Score']
s = pd.concat({3: g.last()-g.first()})
# User
# 3 Bill 10
# Sam 0
# Wiz 5
# Name: Mental Health Score, dtype: int64
df.merge(s, left_on=['Timepoint', 'User'], right_index=True, how='left')
output:
User Timepoint Mental Health Score_x Mental Health Score_y
0 Bill 1 5 NaN
1 Bill 2 10 NaN
2 Bill 3 15 10.0
3 Wiz 1 10 NaN
4 Wiz 2 10 NaN
5 Wiz 3 15 5.0
6 Sam 1 5 NaN
7 Sam 2 5 NaN
8 Sam 3 5 0.0
Here's another possible solution:
import pandas as pd
def calculate_change(mhs):
mhs = list(mhs)
return mhs[-1] - mhs[0]
df = df.sort_values(["User", "Timepoint"])
diff = df.groupby('User')['Mental Health Score'].agg(calculate_change)
df = pd.merge(df, diff, how='left', left_on='User', right_index=True)
df.columns = ['User', 'Timepoint', 'Mental Health Score', 'Change']
df['Change'] = df['Change'].loc[df['Timepoint']==3]
print(df)
Output
User Timepoint Mental Health Score Change
0 Bill 1 5 NaN
1 Bill 2 10 NaN
2 Bill 3 15 10.0
3 Wiz 1 10 NaN
4 Wiz 2 10 NaN
5 Wiz 3 15 5.0
6 Sam 1 5 NaN
7 Sam 2 5 NaN
8 Sam 3 5 0.0
data = {'col1':['Country', 'State', 'City', 'park' ,'avenue'],
'col2':['County','stats','PARK','Avenue', 'cities']}
col1 col2
0 Country County
1 State stats
2 City PARK
3 park Avenue
4 avenue cities
i was trying to match name of two columns with fuzzy wuzzy technique and order them by score.
output:
col1 col2 score order
0 Country County 92 1
1 Country stats 31 2
2 Country PARK 18 3
3 Country Avenue 17 4
4 Country cities 16 5
5 State County 80 1
6 State stats 36 2
7 State PARK 22 3
8 State Avenue 18 4
9 State cities 16 5
.....
what i did:
'''
from fuzzywuzzy import fuzz
import pandas as pd
import numpy as np
for i in df.col1:
for j in df.col2:
print(i,j,fuzz.token_set_ratio(i, j))
'''
i got stuck here..
Let us do
df['score']=df.apply(lambda x : fuzz.ratio(x['col1'],x['col2']),1)
df['score']
0 92
1 60
2 0
3 0
4 17
dtype: int64
Then
df['order']=(-df['score']).groupby(df['col1']).rank(method='first')
If I have a the following dataframe:
df = pd.DataFrame({'name':['john','mary','peter','jeff','bill','lisa','jose'], 'gender':['M','F','M','M','M','F','M'],'state':['california','dc','california','dc','california','texas','texas'],'num_children':[2,0,0,3,2,1,4],'num_pets':[5,1,0,5,2,2,3]})
name gender state num_children num_pets
0 john M california 2 5
1 mary F dc 0 1
2 peter M california 0 0
3 jeff M dc 3 5
4 bill M california 2 2
5 lisa F texas 1 2
6 jose M texas 4 3
I want to create a new row and column pct. to get the percentage of zero values in columns num_children and num_pets
Expected output:
name gender state num_children num_pets pct.
0 pct. 28.6% 14.3%
1 john M california 2 5 0%
2 mary F dc 0 1 50%
3 peter M california 0 0 100%
4 jeff M dc 3 5 0%
5 bill M california 2 2 0%
6 lisa F texas 1 2 0%
7 jose M texas 4 3 0%
I have calculated percentage of zero in each row for targets columns:
df['pct'] = df[['num_children', 'num_pets']].astype(bool).sum(axis=1)/2
df['pct.'] = 1-df['pct']
del df['pct']
df['pct.'] = pd.Series(["{0:.0f}%".format(val * 100) for val in df['pct.']], index = df.index)
name gender state num_children num_pets pct.
0 john M california 2 5 0%
1 mary F dc 0 1 50%
2 peter M california 0 0 100%
3 jeff M dc 3 5 0%
4 bill M california 2 2 0%
5 lisa F texas 1 2 0%
6 jose M texas 4 3 0%
But i don't know how to insert results below to row of pct. as expected output, please help me to get expected result in more pythonic way. Thanks.
df[['num_children', 'num_pets']].astype(bool).sum(axis=0)/len(df.num_children)
Out[153]:
num_children 0.714286
num_pets 0.857143
dtype: float64
UPDATE: same thing but for calculation of sums, great thanks to #jezrael:
df['sums'] = df[['num_children', 'num_pets']].sum(axis=1)
df1 = (df[['num_children', 'num_pets']].sum()
.to_frame()
.T
.assign(name='sums'))
df = pd.concat([df1.reindex(columns=df.columns, fill_value=''), df],
ignore_index=True, sort=False)
print (df)
name gender state num_children num_pets sums
0 sums 12 18
1 john M california 2 5 7
2 mary F dc 0 1 1
3 peter M california 0 0 0
4 jeff M dc 3 5 8
5 bill M california 2 2 4
6 lisa F texas 1 2 3
7 jose M texas 4 3 7
You can use mean with boolean mask by compare 0 values by DataFrame.eq, because sum/len=mean by definition, multiple by 100 and add percentage with apply:
s = df[['num_children', 'num_pets']].eq(0).mean(axis=1)
df['pct'] = s.mul(100).apply("{0:.0f}%".format)
For first row create new DataFrame with same columns like original and concat together:
df1 = (df[['num_children', 'num_pets']].eq(0)
.mean()
.mul(100)
.apply("{0:.1f}%".format)
.to_frame()
.T
.assign(name='pct.'))
df = pd.concat([df1.reindex(columns=df.columns, fill_value=''), df],
ignore_index=True, sort=False)
print (df)
name gender state num_children num_pets pct
0 pct. 28.6% 14.3%
1 john M california 2 5 0%
2 mary F dc 0 1 50%
3 peter M california 0 0 100%
4 jeff M dc 3 5 0%
5 bill M california 2 2 0%
6 lisa F texas 1 2 0%
7 jose M texas 4 3 0%
I have keyword
India
Japan
United States
Germany
China
Here's sample dataframe
id Address
1 Chome-2-8 Shibakoen, Minato, Tokyo 105-0011, Japan
2 Arcisstraße 21, 80333 München, Germany
3 Liberty Street, Manhattan, New York, United States
4 30 Shuangqing Rd, Haidian Qu, Beijing Shi, China
5 Vaishnavi Summit,80feet Road,3rd Block,Bangalore, Karnataka, India
My Goal Is make
id Address India Japan United States Germany China
1 Chome-2-8 Shibakoen, Minato, Tokyo 105-0011, Japan 0 1 0 0 0
2 Arcisstraße 21, 80333 München, Germany 0 0 0 1 0
3 Liberty Street, Manhattan, New York, USA 0 0 1 0 0
4 30 Shuangqing Rd, Haidian Qu, Beijing Shi, China 0 0 0 0 1
5 Vaishnavi Summit,80feet Road,Bangalore, Karnataka, India 1 0 0 0 0
The basic idea is create keyword detector, I am thinking to use str.contain and word2vec but I can't get the logic
Make use of pd.get_dummies():
countries = df.Address.str.extract('(India|Japan|United States|Germany|China)', expand = False)
dummies = pd.get_dummies(countries)
pd.concat([df,dummies],axis = 1)
Also, the most straightforward way is to have the countries in a list and use a for loop, say
countries = ['India','Japan','United States','Germany','China']
for c in countries:
df[c] = df.Address.str.contains(c) * 1
but it can be slow if you have a lot of data and countries.
In [58]: df = df.join(df.Address.str.extract(r'.*,(.*)', expand=False).str.get_dummies())
In [59]: df
Out[59]:
id Address China Germany India Japan United States
0 1 Chome-2-8 Shibakoen, Minato, Tokyo 105-0011, J... 0 0 0 1 0
1 2 Arcisstra?e 21, 80333 Munchen, Germany 0 1 0 0 0
2 3 Liberty Street, Manhattan, New York, United St... 0 0 0 0 1
3 4 30 Shuangqing Rd, Haidian Qu, Beijing Shi, China 1 0 0 0 0
4 5 Vaishnavi Summit,80feet Road,3rd Block,Bangalo... 0 0 1 0 0
NOTE: this method will not work if country is not at the last position in Address column or if country name contains ,
from numpy.core.defchararray import find
kw = 'India|Japan|United States|Germany|China'.split('|')
a = df.Address.values.astype(str)[:, None]
df.join(
pd.DataFrame(
find(a, kw) >= 0,
df.index, kw,
dtype=int
)
)
id Address India Japan United States Germany China
0 1 Chome-2-8 Shibakoen, Minat... 0 1 0 0 0
1 2 Arcisstraße 21, 80333 Münc... 0 0 0 1 0
2 3 Liberty Street, Manhattan,... 0 0 1 0 0
3 4 30 Shuangqing Rd, Haidian ... 0 0 0 0 1
4 5 Vaishnavi Summit,80feet Ro... 1 0 0 0 0
I have a DataFrame (df) with various columns. In this assignment I have to find the difference between summer gold medals and winter gold medals, relative to total medals, for each country using stats about the olympics.
I must only include countries which have at least one gold medal. I am trying to use dropna() to not include those countries who do not at least have one medal. My current code:
def answer_three():
df['medal_count'] = df['Gold'] - df['Gold.1']
df['medal_count'].dropna()
df['medal_dif'] = df['medal_count'] / df['Gold.2']
df['medal_dif'].dropna()
return df.head()
print (answer_three())
This results in the following output:
# Summer Gold Silver Bronze Total # Winter Gold.1 \
Afghanistan 13 0 0 2 2 0 0
Algeria 12 5 2 8 15 3 0
Argentina 23 18 24 28 70 18 0
Armenia 5 1 2 9 12 6 0
Australasia 2 3 4 5 12 0 0
Silver.1 Bronze.1 Total.1 # Games Gold.2 Silver.2 Bronze.2 \
Afghanistan 0 0 0 13 0 0 2
Algeria 0 0 0 15 5 2 8
Argentina 0 0 0 41 18 24 28
Armenia 0 0 0 11 1 2 9
Australasia 0 0 0 2 3 4 5
Combined total ID medal_count medal_dif
Afghanistan 2 AFG 0 NaN
Algeria 15 ALG 5 1.0
Argentina 70 ARG 18 1.0
Armenia 12 ARM 1 1.0
Australasia 12 ANZ 3 1.0
I need to get rid of both the '0' values in "medal_count" and the NaN in "medal_dif".
I am also aware the maths/way I have written the code is probably incorrect to solve the question, but I think I need to start by dropping these values? Any help with any of the above is greatly appreciated.
You are required to pass an axis e.g. axis=1 into the drop function.
An axis of 0 => row, and 1 => column. 0 seems to be the default.
As you can see the entire column is dropped for axis =1