Count per category on seperate columns - Panda Python

Count per category on seperate columns - Panda Python - python

I have a panda dataframe with 3 columns:
Brand Model car_age
PEUGEOT 207 4. 6-8
BMW 3ER REIHE 2. 1-2
FIAT FIAT DOBLO 3. 3-5
PEUGEOT 207 1. 0
BMW 3ER REIHE 2. 1-2
PEUGEOT 308 2. 1-2
BMW 520D 2. 1-2
... ... ...
And I want to group by Brand and Model and calculate the count per car_age category:
Brand Model "1. 0" "2. 1-2" "3. 3-5" "4. 6-8"
PEUGEOT 207 1 0 0 1
PEUGEOT 308 0 1 0 0
BMW 3ER REIHE 0 2 0 0
BMW 520D 0 1 0 0
FIAT FIAT DOBLO 0 0 1 0
PS: 1. 0 means category one that corresponds to car age of zero. 2. 1-2 means category two that corresponds to car ages between 1-2. I enumerate my categories so they appear in the correct order.
I tried that:
output_count = pd.DataFrame({'Count':df.groupby('Brand','Model','car_age').size()})
but it dropped an error:
ValueError: No axis named Model for object type <class
'pandas.core.frame.DataFrame'>
Could anyone help me with this issue?
I think I provided enough information, but let me know if I can provide more.

Use pd.crosstab:
pd.crosstab([df['Brand'], df['Model']], df['car_age']).reset_index()
Output:
car_age Brand Model 1. 0 2. 1-2 3. 3-5 4. 6-8
0 BMW 3ER REIHE 0 2 0 0
1 BMW 520D 0 1 0 0
2 FIAT FIAT DOBLO 0 0 1 0
3 PEUGEOT 207 1 0 0 1
4 PEUGEOT 308 0 1 0 0

The correct way to group by a data-frame with multiple columns is to use square brace around the columns name
df.groupby(['Brand','Model','car_age'])
I hope it will help you to solve your problem.

Here is a function you can call. If you want to see how it works granularly
def group_by(df):
data_dumm = pd.get_dummies(df['car_age'])
data =df.drop(columns='car_age')
X= pd.concat([data,data_dumm], axis=1).groupby(['Brand','Model']).sum()
return X.reset_index()
group_by(df)
output:
Brand Model 1. 0 2. 1-2 3. 3-5 4. 6-8
0 BMW 3ER REIHE 0 2 0 0
1 BMW 520D 0 1 0 0
2 FIAT FIAT DOBLO 0 0 1 0
3 PEUGEOT 207 1 0 0 1
4 PEUGEOT 308 0 1 0 0

Related

replace strings in every column with numbers

This question is an extension of this question. Consider the pandas DataFrame visualized in the table below.
respondent
brand
engine
country
aware
aware_2
aware_3
age
tesst
set
0
a
volvo
p
swe
1
0
1
23
set
set
1
b
volvo
None
swe
0
0
1
45
set
set
2
c
bmw
p
us
0
0
1
56
test
test
3
d
bmw
p
us
0
1
1
43
test
test
4
e
bmw
d
germany
1
0
1
34
set
set
5
f
audi
d
germany
1
0
1
59
set
set
6
g
volvo
d
swe
1
0
0
65
test
set
7
h
audi
d
swe
1
0
0
78
test
set
8
i
volvo
d
us
1
1
1
32
set
set
To convert a column with String entries, one should do a map and then pandas.replace().
For example:
mapping = {'set': 1, 'test': 2}
df.replace({'set': mapping, 'tesst': mapping})
This would lead to the following DataFrame (table):
respondent
brand
engine
country
aware
aware_2
aware_3
age
tesst
set
0
a
volvo
p
swe
1
0
1
23
1
1
1
b
volvo
None
swe
0
0
1
45
1
1
2
c
bmw
p
us
0
0
1
56
2
2
3
d
bmw
p
us
0
1
1
43
2
2
4
e
bmw
d
germany
1
0
1
34
1
1
5
f
audi
d
germany
1
0
1
59
1
1
6
g
volvo
d
swe
1
0
0
65
2
1
7
h
audi
d
swe
1
0
0
78
2
1
8
i
volvo
d
us
1
1
1
32
1
1
As seen above, the last two column's strings are replaced with numbers representing these strings.
The question is then: Is there a faster and not so hands-on approach to replace all the strings into a number? Can one automatically create a mapping (and output it somewhere for human reference)?
Something that makes the DataFrame end up like:
respondent
brand
engine
country
aware
aware_2
aware_3
age
tesst
set
0
1
1
1
1
1
0
1
23
1
1
1
2
1
2
1
0
0
1
45
1
1
2
3
2
1
2
0
0
1
56
2
2
3
4
2
1
2
0
1
1
43
2
2
4
5
2
3
3
1
0
1
34
1
1
5
6
3
3
3
1
0
1
59
1
1
6
7
1
3
1
1
0
0
65
2
1
7
8
3
3
1
1
0
0
78
2
1
8
9
1
3
2
1
1
1
32
1
1
Also output:
[{'volvo': 1, 'bmw': 2, 'audi': 3}, {'p': 1, 'None': 2, 'd': 3}, {'swe': 1, 'us': 2, 'germany': 3}]
Note that the output list of maps (dicts) should not be hard-coded but instead produced by the code.

You can adapte the code given in this response
https://stackoverflow.com/a/39989896/15320403 (inside the post you linked) to generate a mapping for each column of your choice and apply replace as you suggested
all_brands = df.brand.unique()
brand_dic = dict(zip(all_brands, range(len(all_brands))))

You will need to first change the type of the columns to Categorical and then create a new column or overwrite the existing column with codes:
df['brand'] = pd.Categorical(df['brand'])
df['brand_codes'] = df['brand'].cat.codes
If you need the mapping:
dict(enumerate(df['brand'].cat.categories )) #This will work only after you've converted the column to categorical

From the other answers, I've written this function to do solve the problem:
import pandas as pd
def convertStringColumnsToNum(data):
columns = data.columns
columns_dtypes = data.dtypes
maps = []
for col_idx in range(0, len(columns)):
# don't change columns already comprising of numbers
if(columns_dtypes[col_idx] == 'int64'): # can be extended to more dtypes
continue
# inspired from Shivam Roy's answer
col = columns[col_idx]
tmp = pd.Categorical(data[col])
data[col] = tmp.codes
maps.append(tmp.categories)
return maps
This function returns the mapss used to replace strings with a numeral code. The code is the index in which a string resides inside the list. This function works, yet it comes with the SettingWithCopyWarning.
if it ain't broke don't fix it, right? ;)
*but if anyone has a way to adapt this function so that the warning is no longer shown, feel free to comment on it. Yet it works *shrugs* *

Convert text to binary columns

I have a column in my dataframe that contains many different companies separated by commas (assume there are additional rows with even more companies).
company
apple,microsoft,disney,nike
microsoft,adidas,amazon,eBay
I want to convert this to binary columns for every possible company that appears. It should ultimately look like this:
adidas apple amazon eBay disney microsoft nike ... last_store
0 1 0 0 1 1 1 ... 0
1 0 1 1 0 1 0 ... 0

Let us try get_dummies
s=df.brand.str.get_dummies(',')
adidas amazon apple disney eBay microsoft nike
0 0 0 1 1 0 1 1
1 1 1 0 0 1 1 0

How to create correlation based on different columns in pandas?

I have data like this:
Users_id My_Fav Bro_Fav Friend_Fav
User0 BMW VW BMW
UserA VW Mercedes Honda
UserB Honda Honda VW
UserC Mercedes BMW Mercedes
UserD VW BMW BMW
I would like output for correlation between Columns and Brands and desired output would be like this:
My_Fav Bro_Fav Friend_Fav
BMW 1 2 2
VW 2 1 1
Honda 1 1 1
Mercedes 1 1 1

You can count columns values per columns and then sum per index values, if necessary convert Users_id column to index in first step:
#Users_id is column
df = df.set_index('Users_id').apply(pd.value_counts).sum(level=0)
#Users_id is index
#df = df.apply(pd.value_counts).sum(level=0)
print (df)
My_Fav Bro_Fav Friend_Fav
BMW 1 2 2
Honda 1 1 1
Mercedes 1 1 1
VW 2 1 1

IIUC melt + crosstab
s=df.melt('Users_id')
s=pd.crosstab(s.value,s.variable)
variable Bro_Fav Friend_Fav My_Fav
value
BMW 2 2 1
Honda 1 1 1
Mercedes 1 1 1
VW 1 1 2

Is there a function in Pandas or Numpy to populate dataframe from a table

I want to create a row for each number in Volume for each make and company.
Not sure if there is a way, my googling skills did not find any solutions...
I have tried som for loops, but without success.
Here is a table as example.
0 Ford 2000 CompanyX
1 Volvo 3000 CompanyX
2 Mazda 2400 CompanyX
3 Fiat 1000 CompanyX
4 Ford 2000 CompanyY
5 Volvo 3000 CompanyY
6 Mazda 2400 CompanyY
7 Fiat 1000 CompanyY
End result should then be 16800 rows where each row should be e.g.
0 Ford 1 CompanyX
1 Ford 1 CompanyX
etc.

You can use "apply" method with "lambda" :
Solution 1:
Run on rows, one by one and for each row, pass the volume to a specific function and return an array with ones.
from datetime import timedelta
def check(volume):
return [1 for i in range(volume)]
df1['Volume'] = df1.apply(lambda row: check(row['Volume']),axis=1)
df1 = df1.explode('Volume')
df1
Solution 2:
Apply only on the "Volume" column, so for each value, get the ones as an array.
df1['Volume'] = df1['Volume'].apply(lambda col: [1 for i in range(col)])
df1 = df1.explode('Volume')
df1
Result :
User ID Make Volume Company
0 1 Ford 1 CompanyX
0 1 Ford 1 CompanyX
0 1 Ford 1 CompanyX
0 1 Ford 1 CompanyX
0 1 Ford 1 CompanyX
0 1 Ford 1 CompanyX
0 1 Ford 1 CompanyX
0 1 Ford 1 CompanyX
0 1 Ford 1 CompanyX
0 1 Ford 1 CompanyX
1 2 Volvo 1 CompanyY
1 2 Volvo 1 CompanyY
1 2 Volvo 1 CompanyY
1 2 Volvo 1 CompanyY
1 2 Volvo 1 CompanyY
I also compared the performance between 3 methods. The 3rd method is rafaelc's answer above.

IIUC, you may use loc+df.index.repeat, and then just set your vol to 1.
df = df.loc[df.index.repeat(df.vol)]
df['vol'] = 1

Split ("extract") Columns using Panda [duplicate]

This question already has answers here:
Python Pandas: create a new column for each different value of a source column (with boolean output as column values)
(4 answers)
Closed 4 years ago.
I currently have a column called Country that can have a value of USA, Canada, Japan. For example:
Country
-------
Japan
Japan
USA
....
Canada
I want to split ("extract") the values into three individual columns (Country_USA, Country_Canada, and Country_Japan), and basically, a column will have a value of 1 if it matches the original value from the Country column. For example:
Country --> Country_Japan Country_USA Country_Canada
------- ------------- ----------- ---------------
Japan 1 0 0
USA 0 1 0
Japan 1 0 0
....
Is there a simple (non-tedious) way to do this using Panda / Python 3.x? Thanks!

Use join with get_dummies and with add_prefix:
print(df.join(df['Country'].str.get_dummies().add_prefix('Country_')))
Demo:
df=pd.DataFrame({'Country':['Japan','USA','Japan','Canada']})
print(df.join(df['Country'].str.get_dummies().add_prefix('Country_')))
Output:
Country Country_Canada Country_Japan Country_USA
0 Japan 0 1 0
1 USA 0 0 1
2 Japan 0 1 0
3 Canada 1 0 0
Better version, thanks to Scott:
print(df.join(pd.get_dummies(df)))
Output:
Country Country_Canada Country_Japan Country_USA
0 Japan 0 1 0
1 USA 0 0 1
2 Japan 0 1 0
3 Canada 1 0 0
Another good version from Scott:
print(df.assign(**pd.get_dummies(df)))
Output:
Country Country_Canada Country_Japan Country_USA
0 Japan 0 1 0
1 USA 0 0 1
2 Japan 0 1 0
3 Canada 1 0 0

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Count per category on seperate columns - Panda Python - python

Use pd.crosstab: pd.crosstab([df['Brand'], df['Model']], df['car_age']).reset_index() Output: car_age Brand Model 1. 0 2. 1-2 3. 3-5 4. 6-8 0 BMW 3ER REIHE 0 2 0 0 1 BMW 520D 0 1 0 0 2 FIAT FIAT DOBLO 0 0 1 0 3 PEUGEOT 207 1 0 0 1 4 PEUGEOT 308 0 1 0 0

The correct way to group by a data-frame with multiple columns is to use square brace around the columns name df.groupby(['Brand','Model','car_age']) I hope it will help you to solve your problem.

Related

replace strings in every column with numbers

Convert text to binary columns

How to create correlation based on different columns in pandas?

Is there a function in Pandas or Numpy to populate dataframe from a table

Split ("extract") Columns using Panda [duplicate]

Categories

Resources