Split ("extract") Columns using Panda [duplicate] - python

This question already has answers here:
Python Pandas: create a new column for each different value of a source column (with boolean output as column values)
(4 answers)
Closed 4 years ago.
I currently have a column called Country that can have a value of USA, Canada, Japan. For example:
Country
-------
Japan
Japan
USA
....
Canada
I want to split ("extract") the values into three individual columns (Country_USA, Country_Canada, and Country_Japan), and basically, a column will have a value of 1 if it matches the original value from the Country column. For example:
Country --> Country_Japan Country_USA Country_Canada
------- ------------- ----------- ---------------
Japan 1 0 0
USA 0 1 0
Japan 1 0 0
....
Is there a simple (non-tedious) way to do this using Panda / Python 3.x? Thanks!

Use join with get_dummies and with add_prefix:
print(df.join(df['Country'].str.get_dummies().add_prefix('Country_')))
Demo:
df=pd.DataFrame({'Country':['Japan','USA','Japan','Canada']})
print(df.join(df['Country'].str.get_dummies().add_prefix('Country_')))
Output:
Country Country_Canada Country_Japan Country_USA
0 Japan 0 1 0
1 USA 0 0 1
2 Japan 0 1 0
3 Canada 1 0 0
Better version, thanks to Scott:
print(df.join(pd.get_dummies(df)))
Output:
Country Country_Canada Country_Japan Country_USA
0 Japan 0 1 0
1 USA 0 0 1
2 Japan 0 1 0
3 Canada 1 0 0
Another good version from Scott:
print(df.assign(**pd.get_dummies(df)))
Output:
Country Country_Canada Country_Japan Country_USA
0 Japan 0 1 0
1 USA 0 0 1
2 Japan 0 1 0
3 Canada 1 0 0

Related

Convert text to binary columns

I have a column in my dataframe that contains many different companies separated by commas (assume there are additional rows with even more companies).
company
apple,microsoft,disney,nike
microsoft,adidas,amazon,eBay
I want to convert this to binary columns for every possible company that appears. It should ultimately look like this:
adidas apple amazon eBay disney microsoft nike ... last_store
0 1 0 0 1 1 1 ... 0
1 0 1 1 0 1 0 ... 0
Let us try get_dummies
s=df.brand.str.get_dummies(',')
adidas amazon apple disney eBay microsoft nike
0 0 0 1 1 0 1 1
1 1 1 0 0 1 1 0

count specific string in column on a dataframe python [duplicate]

This question already has an answer here:
Group Value Count By Column with Pandas Dataframe
(1 answer)
Closed 3 years ago.
I have a (101×1766) dataframe and I put a sample down.
Index Id Brand1 Brand2 Brand3
0 1 NaN Good Bad
1 2 Bad NaN NaN
2 3 NaN NaN VeryBad
3 4 Good NaN NaN
4 5 NaN Good VeryGood
5 6 VeryBad Good NaN
What I want to achieve is a table like that
Index VeryBad Bad Good VeryGood
Brand1 1 1 0 0
Brand2 0 0 3 0
Brand3 1 1 0 1
I could not find a solution even a wrong one at all.
So, hope to see your help
Let us do two steps : melt + crosstab
s=df.melt(['Id','Index'])
yourdf=pd.crosstab(s.variable,s.value)
yourdf
value Bad Good VeryBad VeryGood
variable
Brand1 1 1 1 0
Brand2 0 3 0 0
Brand3 1 0 1 1
Select all columns without first by DataFrame.iloc, then count values by value_counts, replace non matched missing values, convert to integers, transpose and last for change order of columns use reindex:
cols = ['VeryBad','Bad','Good','VeryGood']
df = df.iloc[:, 1:].apply(pd.value_counts).fillna(0).astype(int).T.reindex(cols, axis=1)
print (df)
VeryBad Bad Good VeryGood
Brand1 1 1 1 0
Brand2 0 0 3 0
Brand3 1 1 0 1
Here is an approach using melt and pivot_table:
(df.melt(id_vars='Id')
.pivot_table(index='variable',
columns='value',
aggfunc='count',
fill_value=0))
[out]
Id
value Bad Good VeryBad VeryGood
variable
Brand1 1 1 1 0
Brand2 0 3 0 0
Brand3 1 0 1 1
Another way is : get_dummies with transpose + groupby()+sum()
m=pd.get_dummies(df.set_index('Id').T)
final=m.groupby(m.columns.str.split('_').str[1],axis=1).sum()
Bad Good VeryBad VeryGood
Brand1 1 1 1 0
Brand2 0 3 0 0
Brand3 1 0 1 1

How to group rows so as to use value_counts on the created groups with pandas?

I have some customer data such as this in a data frame:
S No Country Sex
1 Spain M
2 Norway F
3 Mexico M
...
I want to have an output such as this:
Spain
M = 1207
F = 230
Norway
M = 33
F = 102
...
I have a basic notion that I want to group my rows based on their countries with something like df.groupby(df.Country), and on the selected rows, I need to run something like df.Sex.value_counts()
Thanks!
I think need crosstab:
df = pd.crosstab(df.Sex, df.Country)
Or if want use your solution add unstack for columns with first level of MultiIndex:
df = df.groupby(df.Country).Sex.value_counts().unstack(level=0, fill_value=0)
print (df)
Country Mexico Norway Spain
Sex
F 0 1 0
M 1 0 1
EDIT:
If want add more columns then is possible set which level parameter is converted to columns:
df1 = df.groupby([df.No, df.Country]).Sex.value_counts().unstack(level=0, fill_value=0).reset_index()
print (df1)
No Country Sex 1 2 3
0 Mexico M 0 0 1
1 Norway F 0 1 0
2 Spain M 1 0 0
df2 = df.groupby([df.No, df.Country]).Sex.value_counts().unstack(level=1, fill_value=0).reset_index()
print (df2)
Country No Sex Mexico Norway Spain
0 1 M 0 0 1
1 2 F 0 1 0
2 3 M 1 0 0
df2 = df.groupby([df.No, df.Country]).Sex.value_counts().unstack(level=2, fill_value=0).reset_index()
print (df2)
Sex No Country F M
0 1 Spain 0 1
1 2 Norway 1 0
2 3 Mexico 0 1
You can also use pandas.pivot_table:
res = df.pivot_table(index='Country', columns='Sex', aggfunc='count', fill_value=0)
print(res)
SNo
Sex F M
Country
Mexico 0 1
Norway 1 0
Spain 0 1

Pandas Dataframe fillna() using other known column values

Given the following sample df:
Other1 Other2 Name Value
0 0 1 Johnson C
1 0 0 Johnson NaN
2 1 1 Smith R
3 1 1 Smith NaN
4 0 1 Jackson X
5 1 1 Jackson NaN
6 1 1 Jackson NaN
I want to be able to fill the NaN values with the df['Value'] value associated with the given name in that row. My desired outcome is the following, which I know can be achieved like so:
df['Value'] = df['Value'].fillna(method='ffill')
Other1 Other2 Name Value
0 0 1 Johnson C
1 0 0 Johnson C
2 1 1 Smith R
3 1 1 Smith R
4 0 1 Jackson X
5 1 1 Jackson X
6 1 1 Jackson X
However, this solution will not achieve the desired result if the names are not followed by one another in order. I also cannot sort by df['Name'], as the order is important. Is there an efficient means of simply filling a given NaN value by it's associated name value and assigning it to that?
It's also important to note that a given Name will always only have a single value associated with it. Thank you in advance.
You should use groupby and transform:
df['Value'] = df.groupby('Name')['Value'].transform('first')
df
Other1 Other2 Name Value
0 0 1 Johnson C
1 0 0 Johnson C
2 1 1 Smith R
3 1 1 Smith R
4 0 1 Jackson X
5 1 1 Jackson X
6 1 1 Jackson X
Peter's answer is not correct because the first valid value may not always be the first in the group, in which case ffill will pollute the next group with the previous group's value.
ALollz's answer is fine, but dropna incurs some degree of overhead.

How to do keyword mapping in pandas

I have keyword
India
Japan
United States
Germany
China
Here's sample dataframe
id Address
1 Chome-2-8 Shibakoen, Minato, Tokyo 105-0011, Japan
2 Arcisstraße 21, 80333 München, Germany
3 Liberty Street, Manhattan, New York, United States
4 30 Shuangqing Rd, Haidian Qu, Beijing Shi, China
5 Vaishnavi Summit,80feet Road,3rd Block,Bangalore, Karnataka, India
My Goal Is make
id Address India Japan United States Germany China
1 Chome-2-8 Shibakoen, Minato, Tokyo 105-0011, Japan 0 1 0 0 0
2 Arcisstraße 21, 80333 München, Germany 0 0 0 1 0
3 Liberty Street, Manhattan, New York, USA 0 0 1 0 0
4 30 Shuangqing Rd, Haidian Qu, Beijing Shi, China 0 0 0 0 1
5 Vaishnavi Summit,80feet Road,Bangalore, Karnataka, India 1 0 0 0 0
The basic idea is create keyword detector, I am thinking to use str.contain and word2vec but I can't get the logic
Make use of pd.get_dummies():
countries = df.Address.str.extract('(India|Japan|United States|Germany|China)', expand = False)
dummies = pd.get_dummies(countries)
pd.concat([df,dummies],axis = 1)
Also, the most straightforward way is to have the countries in a list and use a for loop, say
countries = ['India','Japan','United States','Germany','China']
for c in countries:
df[c] = df.Address.str.contains(c) * 1
but it can be slow if you have a lot of data and countries.
In [58]: df = df.join(df.Address.str.extract(r'.*,(.*)', expand=False).str.get_dummies())
In [59]: df
Out[59]:
id Address China Germany India Japan United States
0 1 Chome-2-8 Shibakoen, Minato, Tokyo 105-0011, J... 0 0 0 1 0
1 2 Arcisstra?e 21, 80333 Munchen, Germany 0 1 0 0 0
2 3 Liberty Street, Manhattan, New York, United St... 0 0 0 0 1
3 4 30 Shuangqing Rd, Haidian Qu, Beijing Shi, China 1 0 0 0 0
4 5 Vaishnavi Summit,80feet Road,3rd Block,Bangalo... 0 0 1 0 0
NOTE: this method will not work if country is not at the last position in Address column or if country name contains ,
from numpy.core.defchararray import find
kw = 'India|Japan|United States|Germany|China'.split('|')
a = df.Address.values.astype(str)[:, None]
df.join(
pd.DataFrame(
find(a, kw) >= 0,
df.index, kw,
dtype=int
)
)
id Address India Japan United States Germany China
0 1 Chome-2-8 Shibakoen, Minat... 0 1 0 0 0
1 2 Arcisstraße 21, 80333 Münc... 0 0 0 1 0
2 3 Liberty Street, Manhattan,... 0 0 1 0 0
3 4 30 Shuangqing Rd, Haidian ... 0 0 0 0 1
4 5 Vaishnavi Summit,80feet Ro... 1 0 0 0 0

Categories