I have a dataframe containing about 300 000 rows with a structure like this:
name Jack
gender M
year 1993
country USA
city Odessa
name John
gender M
year 1992
name Sam
country Canada
city Toronto
Is there a possibility to make dataframe looks like this using Pandas?
name gender year country city
Jack M 1993 USA Odessa
John M 1992
Sam Canada Toronto
Row with "name" is always there, but others could be absent. I try to use iterrows with no success.
In [17]:
g = np.cumsum(df.iloc[: , 0] == 'name')
In [15]:
df.groupby(g).apply(lambda x : pd.DataFrame(x.set_index([0]).T , columns=['name' , 'gender' , 'year' , 'country' , 'city']) )
Out[15]:
name gender year country city
0
1 1 Jack M 1993 USA Odessa
2 1 John M 1992 NaN NaN
3 1 Sam NaN NaN Canada Toronto
Related
I have the following data frame:
Name Age City Gender Country
0 Jane 23 NaN F London
1 Melissa 45 Nan F France
2 John 35 Nan M Toronto
I want to switch value between column based on condition:
if Country equal to Toronto and London
I would like to have this output:
Name Age City Gender Country
0 Jane 23 London F NaN
1 Melissa 45 NaN F France
2 John 35 Toronto M NaN
How can I do this?
I would use .loc to check the rows where Country contains London or Toronto, then set the City column to those values and use another loc statement to replace London and Toronto with Nan in the country column
df.loc[df['Country'].isin(['London', 'Toronto']), 'City'] = df['Country']
df.loc[df['Country'].isin(['London', 'Toronto']), 'Country'] = np.nan
output:
Name Age City Gender Country
0 Jane 23 London F NaN
1 Melissa 45 NaN F France
2 John 35 Toronto M NaN
You could use np.where:
cities = ['London', 'Toronto']
df['City'] = np.where(
df['Country'].isin(cities),
df['Country'],
df['City']
)
df['Country'] = np.where(
df['Country'].isin(cities),
np.nan,
df['Country']
)
Results:
Name Age City Gender Country
0 Jane 23 London F NaN
1 Melissa 45 NaN F France
2 John 35 Toronto M NaN
cond = df['Country'].isin(['London', 'Toronto'])
df['City'].mask(cond, df['Country'], inplace = True)
df['Country'].mask(cond, np.nan, inplace = True)
Name Age City Gender Country
0 Jane 23 London F NaN
1 Melissa 45 NaN F France
2 John 35 Toronto M NaN
I have data on births that looks like this:
Date Country Sex
1.1.20 USA M
1.1.20 USA M
1.1.20 Italy F
1.1.20 England M
2.1.20 Italy F
2.1.20 Italy M
3.1.20 USA F
3.1.20 USA F
My purpose is to get a new dataframe in which each row is a date at a country, and then number of total births, number of male births and number of female births. It's supposed to look like this:
Date Country Births Males Females
1.1.20 USA 2 2 0
1.1.20 Italy 1 0 1
1.1.20 England 1 1 0
2.1.20 Italy 2 1 1
3.1.20 USA 2 0 2
I tried using this code:
df.groupby(by=['Date', 'Country', 'Sex']).size()
but it only gave me a new column of total births, with different rows for each sex in every date+country combination.
any help will be appreciated.
Thanks,
Eran
You can group the dataframe on columns Date and Country then aggregate column Sex using value_counts followed by unstack to reshape, finally assign the Births columns by summing frequency along axis=1:
out = df.groupby(['Date', 'Country'], sort=False)['Sex']\
.value_counts().unstack(fill_value=0)
out.assign(Births=out.sum(1)).reset_index()\
.rename(columns={'M': 'Male', 'F': 'Female'})
Or you can use a very similar approach with .crosstab instead of groupby + value_counts:
out = pd.crosstab([df['Date'], df['Country']], df['Sex'], colnames=[None])
out.assign(Births=out.sum(1)).reset_index()\
.rename(columns={'M': 'Male', 'F': 'Female'})
Date Country Female Male Births
0 1.1.20 USA 0 2 2
1 1.1.20 Italy 1 0 1
2 1.1.20 England 0 1 1
3 2.1.20 Italy 1 1 2
4 3.1.20 USA 2 0 2
I have a datframe df like the following:
df name city
0 John New York
1 Carl New York
2 Carl Paris
3 Eva Paris
4 Eva Paris
5 Carl Paris
I want to know the total number of people in the different cities
df2 city number
0 New York 2
1 Paris 3
or the number of people with the same name in the cities
df2 name city number
0 John New York 1
1 Eva Paris 2
2 Carl Paris 2
3 Eva New York 0
I believe need GroupBy.size:
df1 = df.groupby(['city']).size().reset_index(name='number')
print (df1)
city number
0 New York 2
1 Paris 4
df2 = df.groupby(['name','city']).size().reset_index(name='number')
print (df2)
name city number
0 Carl New York 1
1 Carl Paris 2
2 Eva Paris 2
3 John New York 1
If need all combinations one solution is add unstack and stack:
df3=df.groupby(['name','city']).size().unstack(fill_value=0).stack().reset_index(name='count')
print (df3)
name city number
0 Carl New York 1
1 Carl Paris 2
2 Eva New York 0
3 Eva Paris 2
4 John New York 1
5 John Paris 0
Or reindex with MultiIndex.from_product:
df2 = df.groupby(['name','city']).size()
mux = pd.MultiIndex.from_product(df2.index.levels, names=df2.index.names)
df2 = df2.reindex(mux, fill_value=0).reset_index(name='number')
print (df2)
name city number
0 Carl New York 1
1 Carl Paris 2
2 Eva New York 0
3 Eva Paris 2
4 John New York 1
5 John Paris 0
To count the number of people with different names in the same city:
groups = df.groupby('city').count().reset_index()
To count the number of people with the same name in different cities:
groups = df.groupby('city').count().reset_index()
I have a data frame like:
Company Country
ABC USA
ABC USA
BCD USA
BCD USA
ABC USA
The output should be : -
Company Country
ABC USA
BCD USA
I think you need drop_duplicates if need unique values in all columns:
df = df.drop_duplicates()
print (df)
Company Country
0 ABC USA
2 BCD USA
Or if need specify column(s) for check duplicates add parameter subset:
df = df.drop_duplicates(subset=['Company'])
print (df)
Company Country
0 ABC USA
2 BCD USA
And solution with groupby and aggregate first:
df = df.groupby('Company', as_index=False).first()
print (df)
Company Country
0 ABC USA
1 BCD USA
Just for the sake of completeness, you can also use:
df.groupby('Company').head(1)
Out:
Company Country
0 ABC USA
2 BCD USA
I have a DataFrame like this:
name sex births year
0 Mary F 7433 2000
1 John M 6542 2000
2 Emma F 2342 2000
3 Ron M 5432 2001
4 Bessie F 4234 2001
5 Jennie F 2413 2002
6 Nick M 2343 2002
7 Ron M 4342 2002
I need to get new DataFrame where data will be grouped by year and sex, and last two columns will be name with max births and max (births) value, like this:
year sex name births
0 2000 F Mary 7433
1 2000 M John 6542
2 2001 F Bessie 4234
3 2001 M Ron 5432
4 2002 F Jennie 2413
5 2002 M Ron 4342
It can be done using the following groupby operation:
>>> df.groupby(['year', 'sex'], as_index=False).max()
year sex name births
0 2000 F Mary 7433
1 2000 M John 6542
2 2001 F Bessie 4234
3 2001 M Ron 5432
4 2002 F Jennie 2413
5 2002 M Ron 4342
as_index=False stops the groupby keys from becoming the index in the returned DataFrame.
Alternatively, to get the desired output you may need to to sort the 'births' column and then use groupby.first():
df = df.sort_values(by='births', ascending=False)
df.groupby(['year', 'sex'], as_index=False).first()