I have a panda dataframe with 3 columns:
Brand Model car_age
PEUGEOT 207 4. 6-8
BMW 3ER REIHE 2. 1-2
FIAT FIAT DOBLO 3. 3-5
PEUGEOT 207 1. 0
BMW 3ER REIHE 2. 1-2
PEUGEOT 308 2. 1-2
BMW 520D 2. 1-2
... ... ...
And I want to group by Brand and Model and calculate the count per car_age category:
Brand Model "1. 0" "2. 1-2" "3. 3-5" "4. 6-8"
PEUGEOT 207 1 0 0 1
PEUGEOT 308 0 1 0 0
BMW 3ER REIHE 0 2 0 0
BMW 520D 0 1 0 0
FIAT FIAT DOBLO 0 0 1 0
PS: 1. 0 means category one that corresponds to car age of zero. 2. 1-2 means category two that corresponds to car ages between 1-2. I enumerate my categories so they appear in the correct order.
I tried that:
output_count = pd.DataFrame({'Count':df.groupby('Brand','Model','car_age').size()})
but it dropped an error:
ValueError: No axis named Model for object type <class
'pandas.core.frame.DataFrame'>
Could anyone help me with this issue?
I think I provided enough information, but let me know if I can provide more.
Use pd.crosstab:
pd.crosstab([df['Brand'], df['Model']], df['car_age']).reset_index()
Output:
car_age Brand Model 1. 0 2. 1-2 3. 3-5 4. 6-8
0 BMW 3ER REIHE 0 2 0 0
1 BMW 520D 0 1 0 0
2 FIAT FIAT DOBLO 0 0 1 0
3 PEUGEOT 207 1 0 0 1
4 PEUGEOT 308 0 1 0 0
The correct way to group by a data-frame with multiple columns is to use square brace around the columns name
df.groupby(['Brand','Model','car_age'])
I hope it will help you to solve your problem.
Here is a function you can call. If you want to see how it works granularly
def group_by(df):
data_dumm = pd.get_dummies(df['car_age'])
data =df.drop(columns='car_age')
X= pd.concat([data,data_dumm], axis=1).groupby(['Brand','Model']).sum()
return X.reset_index()
group_by(df)
output:
Brand Model 1. 0 2. 1-2 3. 3-5 4. 6-8
0 BMW 3ER REIHE 0 2 0 0
1 BMW 520D 0 1 0 0
2 FIAT FIAT DOBLO 0 0 1 0
3 PEUGEOT 207 1 0 0 1
4 PEUGEOT 308 0 1 0 0
Related
This question is an extension of this question. Consider the pandas DataFrame visualized in the table below.
respondent
brand
engine
country
aware
aware_2
aware_3
age
tesst
set
0
a
volvo
p
swe
1
0
1
23
set
set
1
b
volvo
None
swe
0
0
1
45
set
set
2
c
bmw
p
us
0
0
1
56
test
test
3
d
bmw
p
us
0
1
1
43
test
test
4
e
bmw
d
germany
1
0
1
34
set
set
5
f
audi
d
germany
1
0
1
59
set
set
6
g
volvo
d
swe
1
0
0
65
test
set
7
h
audi
d
swe
1
0
0
78
test
set
8
i
volvo
d
us
1
1
1
32
set
set
To convert a column with String entries, one should do a map and then pandas.replace().
For example:
mapping = {'set': 1, 'test': 2}
df.replace({'set': mapping, 'tesst': mapping})
This would lead to the following DataFrame (table):
respondent
brand
engine
country
aware
aware_2
aware_3
age
tesst
set
0
a
volvo
p
swe
1
0
1
23
1
1
1
b
volvo
None
swe
0
0
1
45
1
1
2
c
bmw
p
us
0
0
1
56
2
2
3
d
bmw
p
us
0
1
1
43
2
2
4
e
bmw
d
germany
1
0
1
34
1
1
5
f
audi
d
germany
1
0
1
59
1
1
6
g
volvo
d
swe
1
0
0
65
2
1
7
h
audi
d
swe
1
0
0
78
2
1
8
i
volvo
d
us
1
1
1
32
1
1
As seen above, the last two column's strings are replaced with numbers representing these strings.
The question is then: Is there a faster and not so hands-on approach to replace all the strings into a number? Can one automatically create a mapping (and output it somewhere for human reference)?
Something that makes the DataFrame end up like:
respondent
brand
engine
country
aware
aware_2
aware_3
age
tesst
set
0
1
1
1
1
1
0
1
23
1
1
1
2
1
2
1
0
0
1
45
1
1
2
3
2
1
2
0
0
1
56
2
2
3
4
2
1
2
0
1
1
43
2
2
4
5
2
3
3
1
0
1
34
1
1
5
6
3
3
3
1
0
1
59
1
1
6
7
1
3
1
1
0
0
65
2
1
7
8
3
3
1
1
0
0
78
2
1
8
9
1
3
2
1
1
1
32
1
1
Also output:
[{'volvo': 1, 'bmw': 2, 'audi': 3}, {'p': 1, 'None': 2, 'd': 3}, {'swe': 1, 'us': 2, 'germany': 3}]
Note that the output list of maps (dicts) should not be hard-coded but instead produced by the code.
You can adapte the code given in this response
https://stackoverflow.com/a/39989896/15320403 (inside the post you linked) to generate a mapping for each column of your choice and apply replace as you suggested
all_brands = df.brand.unique()
brand_dic = dict(zip(all_brands, range(len(all_brands))))
You will need to first change the type of the columns to Categorical and then create a new column or overwrite the existing column with codes:
df['brand'] = pd.Categorical(df['brand'])
df['brand_codes'] = df['brand'].cat.codes
If you need the mapping:
dict(enumerate(df['brand'].cat.categories )) #This will work only after you've converted the column to categorical
From the other answers, I've written this function to do solve the problem:
import pandas as pd
def convertStringColumnsToNum(data):
columns = data.columns
columns_dtypes = data.dtypes
maps = []
for col_idx in range(0, len(columns)):
# don't change columns already comprising of numbers
if(columns_dtypes[col_idx] == 'int64'): # can be extended to more dtypes
continue
# inspired from Shivam Roy's answer
col = columns[col_idx]
tmp = pd.Categorical(data[col])
data[col] = tmp.codes
maps.append(tmp.categories)
return maps
This function returns the mapss used to replace strings with a numeral code. The code is the index in which a string resides inside the list. This function works, yet it comes with the SettingWithCopyWarning.
if it ain't broke don't fix it, right? ;)
*but if anyone has a way to adapt this function so that the warning is no longer shown, feel free to comment on it. Yet it works *shrugs* *
I have a column in my dataframe that contains many different companies separated by commas (assume there are additional rows with even more companies).
company
apple,microsoft,disney,nike
microsoft,adidas,amazon,eBay
I want to convert this to binary columns for every possible company that appears. It should ultimately look like this:
adidas apple amazon eBay disney microsoft nike ... last_store
0 1 0 0 1 1 1 ... 0
1 0 1 1 0 1 0 ... 0
Let us try get_dummies
s=df.brand.str.get_dummies(',')
adidas amazon apple disney eBay microsoft nike
0 0 0 1 1 0 1 1
1 1 1 0 0 1 1 0
I have data like this:
Users_id My_Fav Bro_Fav Friend_Fav
User0 BMW VW BMW
UserA VW Mercedes Honda
UserB Honda Honda VW
UserC Mercedes BMW Mercedes
UserD VW BMW BMW
I would like output for correlation between Columns and Brands and desired output would be like this:
My_Fav Bro_Fav Friend_Fav
BMW 1 2 2
VW 2 1 1
Honda 1 1 1
Mercedes 1 1 1
You can count columns values per columns and then sum per index values, if necessary convert Users_id column to index in first step:
#Users_id is column
df = df.set_index('Users_id').apply(pd.value_counts).sum(level=0)
#Users_id is index
#df = df.apply(pd.value_counts).sum(level=0)
print (df)
My_Fav Bro_Fav Friend_Fav
BMW 1 2 2
Honda 1 1 1
Mercedes 1 1 1
VW 2 1 1
IIUC melt + crosstab
s=df.melt('Users_id')
s=pd.crosstab(s.value,s.variable)
variable Bro_Fav Friend_Fav My_Fav
value
BMW 2 2 1
Honda 1 1 1
Mercedes 1 1 1
VW 1 1 2
I want to create a row for each number in Volume for each make and company.
Not sure if there is a way, my googling skills did not find any solutions...
I have tried som for loops, but without success.
Here is a table as example.
0 Ford 2000 CompanyX
1 Volvo 3000 CompanyX
2 Mazda 2400 CompanyX
3 Fiat 1000 CompanyX
4 Ford 2000 CompanyY
5 Volvo 3000 CompanyY
6 Mazda 2400 CompanyY
7 Fiat 1000 CompanyY
End result should then be 16800 rows where each row should be e.g.
0 Ford 1 CompanyX
1 Ford 1 CompanyX
etc.
You can use "apply" method with "lambda" :
Solution 1:
Run on rows, one by one and for each row, pass the volume to a specific function and return an array with ones.
from datetime import timedelta
def check(volume):
return [1 for i in range(volume)]
df1['Volume'] = df1.apply(lambda row: check(row['Volume']),axis=1)
df1 = df1.explode('Volume')
df1
Solution 2:
Apply only on the "Volume" column, so for each value, get the ones as an array.
df1['Volume'] = df1['Volume'].apply(lambda col: [1 for i in range(col)])
df1 = df1.explode('Volume')
df1
Result :
User ID Make Volume Company
0 1 Ford 1 CompanyX
0 1 Ford 1 CompanyX
0 1 Ford 1 CompanyX
0 1 Ford 1 CompanyX
0 1 Ford 1 CompanyX
0 1 Ford 1 CompanyX
0 1 Ford 1 CompanyX
0 1 Ford 1 CompanyX
0 1 Ford 1 CompanyX
0 1 Ford 1 CompanyX
1 2 Volvo 1 CompanyY
1 2 Volvo 1 CompanyY
1 2 Volvo 1 CompanyY
1 2 Volvo 1 CompanyY
1 2 Volvo 1 CompanyY
I also compared the performance between 3 methods. The 3rd method is rafaelc's answer above.
IIUC, you may use loc+df.index.repeat, and then just set your vol to 1.
df = df.loc[df.index.repeat(df.vol)]
df['vol'] = 1
This question already has answers here:
Python Pandas: create a new column for each different value of a source column (with boolean output as column values)
(4 answers)
Closed 4 years ago.
I currently have a column called Country that can have a value of USA, Canada, Japan. For example:
Country
-------
Japan
Japan
USA
....
Canada
I want to split ("extract") the values into three individual columns (Country_USA, Country_Canada, and Country_Japan), and basically, a column will have a value of 1 if it matches the original value from the Country column. For example:
Country --> Country_Japan Country_USA Country_Canada
------- ------------- ----------- ---------------
Japan 1 0 0
USA 0 1 0
Japan 1 0 0
....
Is there a simple (non-tedious) way to do this using Panda / Python 3.x? Thanks!
Use join with get_dummies and with add_prefix:
print(df.join(df['Country'].str.get_dummies().add_prefix('Country_')))
Demo:
df=pd.DataFrame({'Country':['Japan','USA','Japan','Canada']})
print(df.join(df['Country'].str.get_dummies().add_prefix('Country_')))
Output:
Country Country_Canada Country_Japan Country_USA
0 Japan 0 1 0
1 USA 0 0 1
2 Japan 0 1 0
3 Canada 1 0 0
Better version, thanks to Scott:
print(df.join(pd.get_dummies(df)))
Output:
Country Country_Canada Country_Japan Country_USA
0 Japan 0 1 0
1 USA 0 0 1
2 Japan 0 1 0
3 Canada 1 0 0
Another good version from Scott:
print(df.assign(**pd.get_dummies(df)))
Output:
Country Country_Canada Country_Japan Country_USA
0 Japan 0 1 0
1 USA 0 0 1
2 Japan 0 1 0
3 Canada 1 0 0