I have DataFrame like below:
data = pd.DataFrame({"Country" : ["Brazil", "Brazil", "Germany", "Germany", "UK"],
"Order method" : ["Phone", "Retail", "Web", "Web", "Retail"]})
And I would like to create new DataFrame based on above data frame where I would like to see result as below:
Use GroupBy.size with Series.unstack and DataFrame.stack for add missing categories:
s = data.groupby(['Country','Order method']).size().unstack(fill_value=0).stack()
print (s)
Country Order method
Brazil Phone 1
Retail 1
Web 0
Germany Phone 0
Retail 0
Web 2
UK Phone 0
Retail 1
Web 0
dtype: int64
For DataFrame add DataFrame.reset_index:
df = (data.groupby(['Country','Order method'])
.size()
.unstack(fill_value=0)
.stack()
.reset_index(name='Count'))
print (df)
Country Order method Count
0 Brazil Phone 1
1 Brazil Retail 1
2 Brazil Web 0
3 Germany Phone 0
4 Germany Retail 0
5 Germany Web 2
6 UK Phone 0
7 UK Retail 1
8 UK Web 0
Last if necessary replace duplicated values to empty strings use Series.mask with Series.duplicated:
df['Country'] = df['Country'].mask(df['Country'].duplicated(), '')
print (df)
Country Order method Count
0 Brazil Phone 1
1 Retail 1
2 Web 0
3 Germany Phone 0
4 Retail 0
5 Web 2
6 UK Phone 0
7 Retail 1
8 Web 0
Related
I have data on births that looks like this:
Date Country Sex
1.1.20 USA M
1.1.20 USA M
1.1.20 Italy F
1.1.20 England M
2.1.20 Italy F
2.1.20 Italy M
3.1.20 USA F
3.1.20 USA F
My purpose is to get a new dataframe in which each row is a date at a country, and then number of total births, number of male births and number of female births. It's supposed to look like this:
Date Country Births Males Females
1.1.20 USA 2 2 0
1.1.20 Italy 1 0 1
1.1.20 England 1 1 0
2.1.20 Italy 2 1 1
3.1.20 USA 2 0 2
I tried using this code:
df.groupby(by=['Date', 'Country', 'Sex']).size()
but it only gave me a new column of total births, with different rows for each sex in every date+country combination.
any help will be appreciated.
Thanks,
Eran
You can group the dataframe on columns Date and Country then aggregate column Sex using value_counts followed by unstack to reshape, finally assign the Births columns by summing frequency along axis=1:
out = df.groupby(['Date', 'Country'], sort=False)['Sex']\
.value_counts().unstack(fill_value=0)
out.assign(Births=out.sum(1)).reset_index()\
.rename(columns={'M': 'Male', 'F': 'Female'})
Or you can use a very similar approach with .crosstab instead of groupby + value_counts:
out = pd.crosstab([df['Date'], df['Country']], df['Sex'], colnames=[None])
out.assign(Births=out.sum(1)).reset_index()\
.rename(columns={'M': 'Male', 'F': 'Female'})
Date Country Female Male Births
0 1.1.20 USA 0 2 2
1 1.1.20 Italy 1 0 1
2 1.1.20 England 0 1 1
3 2.1.20 Italy 1 1 2
4 3.1.20 USA 2 0 2
I have this dataframe, and I want the count of all non zero values for interaction per month, date and email
DATE LOC EMAIL INTERATION
1/11 INDIA qw#mail.com 0
1/11 INDIA ap#mail.com 11
1/11 LONDON az#mail.com 2
2/11 INDIA qw#mail.com 5
2/11 INDIA rw#mail.com 5
2/11 LONDON az#mail.com 0
3/11 LONDON az#mail.com 1
So my resulting dataframe should look like this:
DATE LOC INTERATION
1/11 INDIA 1
1/11 LONDON 1
2/11 INDIA 2
2/11 LONDON 0
3/11 LONDON 1
Thanks in advance
Use groupby with agg and numpy.count_nonzero:
df1 = df.groupby(['DATE','LOC'], as_index=False)['INTERATION'].agg(np.count_nonzero)
print (df1)
DATE LOC INTERATION
0 1/11 INDIA 1
1 1/11 LONDON 1
2 2/11 INDIA 2
3 2/11 LONDON 0
4 3/11 LONDON 1
Another solution is create boolean mask by compre by not equal by ne, cast to integers and aggregate sum:
df1 = (df.assign(INTERATION = df['INTERATION'].ne(0).astype(int))
.groupby(['DATE','LOC'], as_index=False)['INTERATION']
.sum())
If need group by column EMAIL too:
df2 = df.groupby(['DATE','LOC','EMAIL'], as_index=False)['INTERATION'].agg(np.count_nonzero)
print (df2)
DATE LOC EMAIL INTERATION
0 1/11 INDIA ap#mail.com 1
1 1/11 INDIA qw#mail.com 0
2 1/11 LONDON az#mail.com 1
3 2/11 INDIA qw#mail.com 1
4 2/11 INDIA rw#mail.com 1
5 2/11 LONDON az#mail.com 0
6 3/11 LONDON az#mail.com 1
One not necessarily efficient solution is to convert to bool and then sum. This use the fact 0 / 1 are equivalent to False / True respectively in calculations:
res = df.groupby(['DATE', 'LOC'])['INTERATION']\
.apply(lambda x: x.astype(bool).sum()).reset_index()
print(res)
DATE LOC INTERATION
0 1/11 INDIA 1
1 1/11 LONDON 1
2 2/11 INDIA 2
3 2/11 LONDON 0
4 3/11 LONDON 1
This question already has answers here:
Python Pandas: create a new column for each different value of a source column (with boolean output as column values)
(4 answers)
Closed 4 years ago.
I currently have a column called Country that can have a value of USA, Canada, Japan. For example:
Country
-------
Japan
Japan
USA
....
Canada
I want to split ("extract") the values into three individual columns (Country_USA, Country_Canada, and Country_Japan), and basically, a column will have a value of 1 if it matches the original value from the Country column. For example:
Country --> Country_Japan Country_USA Country_Canada
------- ------------- ----------- ---------------
Japan 1 0 0
USA 0 1 0
Japan 1 0 0
....
Is there a simple (non-tedious) way to do this using Panda / Python 3.x? Thanks!
Use join with get_dummies and with add_prefix:
print(df.join(df['Country'].str.get_dummies().add_prefix('Country_')))
Demo:
df=pd.DataFrame({'Country':['Japan','USA','Japan','Canada']})
print(df.join(df['Country'].str.get_dummies().add_prefix('Country_')))
Output:
Country Country_Canada Country_Japan Country_USA
0 Japan 0 1 0
1 USA 0 0 1
2 Japan 0 1 0
3 Canada 1 0 0
Better version, thanks to Scott:
print(df.join(pd.get_dummies(df)))
Output:
Country Country_Canada Country_Japan Country_USA
0 Japan 0 1 0
1 USA 0 0 1
2 Japan 0 1 0
3 Canada 1 0 0
Another good version from Scott:
print(df.assign(**pd.get_dummies(df)))
Output:
Country Country_Canada Country_Japan Country_USA
0 Japan 0 1 0
1 USA 0 0 1
2 Japan 0 1 0
3 Canada 1 0 0
I have some customer data such as this in a data frame:
S No Country Sex
1 Spain M
2 Norway F
3 Mexico M
...
I want to have an output such as this:
Spain
M = 1207
F = 230
Norway
M = 33
F = 102
...
I have a basic notion that I want to group my rows based on their countries with something like df.groupby(df.Country), and on the selected rows, I need to run something like df.Sex.value_counts()
Thanks!
I think need crosstab:
df = pd.crosstab(df.Sex, df.Country)
Or if want use your solution add unstack for columns with first level of MultiIndex:
df = df.groupby(df.Country).Sex.value_counts().unstack(level=0, fill_value=0)
print (df)
Country Mexico Norway Spain
Sex
F 0 1 0
M 1 0 1
EDIT:
If want add more columns then is possible set which level parameter is converted to columns:
df1 = df.groupby([df.No, df.Country]).Sex.value_counts().unstack(level=0, fill_value=0).reset_index()
print (df1)
No Country Sex 1 2 3
0 Mexico M 0 0 1
1 Norway F 0 1 0
2 Spain M 1 0 0
df2 = df.groupby([df.No, df.Country]).Sex.value_counts().unstack(level=1, fill_value=0).reset_index()
print (df2)
Country No Sex Mexico Norway Spain
0 1 M 0 0 1
1 2 F 0 1 0
2 3 M 1 0 0
df2 = df.groupby([df.No, df.Country]).Sex.value_counts().unstack(level=2, fill_value=0).reset_index()
print (df2)
Sex No Country F M
0 1 Spain 0 1
1 2 Norway 1 0
2 3 Mexico 0 1
You can also use pandas.pivot_table:
res = df.pivot_table(index='Country', columns='Sex', aggfunc='count', fill_value=0)
print(res)
SNo
Sex F M
Country
Mexico 0 1
Norway 1 0
Spain 0 1
I have keyword
India
Japan
United States
Germany
China
Here's sample dataframe
id Address
1 Chome-2-8 Shibakoen, Minato, Tokyo 105-0011, Japan
2 Arcisstraße 21, 80333 München, Germany
3 Liberty Street, Manhattan, New York, United States
4 30 Shuangqing Rd, Haidian Qu, Beijing Shi, China
5 Vaishnavi Summit,80feet Road,3rd Block,Bangalore, Karnataka, India
My Goal Is make
id Address India Japan United States Germany China
1 Chome-2-8 Shibakoen, Minato, Tokyo 105-0011, Japan 0 1 0 0 0
2 Arcisstraße 21, 80333 München, Germany 0 0 0 1 0
3 Liberty Street, Manhattan, New York, USA 0 0 1 0 0
4 30 Shuangqing Rd, Haidian Qu, Beijing Shi, China 0 0 0 0 1
5 Vaishnavi Summit,80feet Road,Bangalore, Karnataka, India 1 0 0 0 0
The basic idea is create keyword detector, I am thinking to use str.contain and word2vec but I can't get the logic
Make use of pd.get_dummies():
countries = df.Address.str.extract('(India|Japan|United States|Germany|China)', expand = False)
dummies = pd.get_dummies(countries)
pd.concat([df,dummies],axis = 1)
Also, the most straightforward way is to have the countries in a list and use a for loop, say
countries = ['India','Japan','United States','Germany','China']
for c in countries:
df[c] = df.Address.str.contains(c) * 1
but it can be slow if you have a lot of data and countries.
In [58]: df = df.join(df.Address.str.extract(r'.*,(.*)', expand=False).str.get_dummies())
In [59]: df
Out[59]:
id Address China Germany India Japan United States
0 1 Chome-2-8 Shibakoen, Minato, Tokyo 105-0011, J... 0 0 0 1 0
1 2 Arcisstra?e 21, 80333 Munchen, Germany 0 1 0 0 0
2 3 Liberty Street, Manhattan, New York, United St... 0 0 0 0 1
3 4 30 Shuangqing Rd, Haidian Qu, Beijing Shi, China 1 0 0 0 0
4 5 Vaishnavi Summit,80feet Road,3rd Block,Bangalo... 0 0 1 0 0
NOTE: this method will not work if country is not at the last position in Address column or if country name contains ,
from numpy.core.defchararray import find
kw = 'India|Japan|United States|Germany|China'.split('|')
a = df.Address.values.astype(str)[:, None]
df.join(
pd.DataFrame(
find(a, kw) >= 0,
df.index, kw,
dtype=int
)
)
id Address India Japan United States Germany China
0 1 Chome-2-8 Shibakoen, Minat... 0 1 0 0 0
1 2 Arcisstraße 21, 80333 Münc... 0 0 0 1 0
2 3 Liberty Street, Manhattan,... 0 0 1 0 0
3 4 30 Shuangqing Rd, Haidian ... 0 0 0 0 1
4 5 Vaishnavi Summit,80feet Ro... 1 0 0 0 0