Python pandas: dynamic concatenation from get_dummies - python

having the following dataframe:
import pandas as pd
cars = ["BMV", "Mercedes", "Audi"]
customer = ["Juan", "Pepe", "Luis"]
price = [100, 200, 300]
year = [2022, 2021, 2020]
df_raw = pd.DataFrame(list(zip(cars, customer, price, year)),\
columns=["cars", "customer", "price", 'year'])
I need to do one-hot encoding from the categorical variables cars and customer, for this I use the get_dummies method for these two columns.
numerical = ["price", "year"]
df_final = pd.concat([df_raw[numerical], pd.get_dummies(df_raw.cars),\
pd.get_dummies(df_raw.customer)], axis=1)
Is there a way to generate these dummies in a dynamic way, like putting them in a list and loop through them with a for.In this case it may seem simple because I only have 2, but if I had 30 or 60 attributes, would I have to go one by one?

pd.get_dummies
pd.get_dummies(df_raw, columns=['cars', 'customer'])
price year cars_Audi cars_BMV cars_Mercedes customer_Juan customer_Luis customer_Pepe
0 100 2022 0 1 0 1 0 0
1 200 2021 0 0 1 0 0 1
2 300 2020 1 0 0 0 1 0

One simple way is to concatenate the columns and use str.get_dummies:
cols = ['cars', 'customer']
out = df_raw.join(df_raw[cols].agg('|'.join, axis=1).str.get_dummies())
output:
cars customer price year Audi BMV Juan Luis Mercedes Pepe
0 BMV Juan 100 2022 0 1 1 0 0 0
1 Mercedes Pepe 200 2021 0 0 0 0 1 1
2 Audi Luis 300 2020 1 0 0 1 0 0
Another option is to melt and use crosstab:
df2 = df_raw[cols].reset_index().melt('index')
out = df_raw.join(pd.crosstab(df2['index'], df2['value']))

Related

Reassign the multiple column's value from other rows in a pandas dataframe

Got a dataframe, CY (current year)- refers to 2022, PY (Previous Year) - refers to 2021, & PPY (Prior to Previous year) refers to 2020. Want to collect this information as in a single row for a single id. Input dataframe looks like -
id Year Jan_CY Feb_CY Jan_PY Feb_PY Jan_PPY Feb_PPY
1 2022 1 2 0 0 0 0
1 2021 0 0 3 4 0 0
1 2020 0 0 0 0 5 6
2 2022 0 0 0 0 0 0
2 2021 0 0 7 8 0 0
2 2020 0 0 0 0 9 10
Output dataframe looks like
id Year Jan_CY Feb_CY Jan_PY Feb_PY Jan_PPY Feb_PPY
0 1 2022 1 2 3 4 5 6
1 2 2022 0 0 7 8 9 10
Tried with below code:
def get_previous_values(row):
cols = row.columns
py_cols = [i for i in cols if i.endswith("_PY")]
ppy_cols = [j for j in cols if j.endswith("_PPY")]
row[py_cols].mask((df['clnt_orgn_id'] == clnt_orgn_id) & (df['SMRY_YR_NO'] == 2021), df[py_cols])
return row
but couldn't solve it.
If the result is just the column sum, then a groupby would do it:
df.groupby("id").sum()
Note: This requires every year to appear exactly one time for each ID.
You can use:
curr_year = 2022
years = {curr_year: 'CY', curr_year-1: 'PY', curr_year-2: 'PPY'}
# or directly
# years = {2022: 'CY', 2021: 'PY', 2020: 'PPY'}
out = (df
.rename(columns=lambda x: x.split('_')[0])
.pivot(index='id', columns='Year')
.sort_index(axis=1, level=1, ascending=False)
.groupby(axis=1, level=[0, 1], sort=False).sum()
.pipe(lambda d: d.set_axis(d.columns.map(lambda x: f'{x[0]}_{years[x[1]]}'), axis=1))
.reset_index()
)
# optionally add the current year
out.insert(1, 'Year', curr_year)
print(out)
Output:
id Year Jan_CY Feb_CY Jan_PY Feb_PY Jan_PPY Feb_PPY
0 1 2022 1 2 3 4 5 6
1 2 2022 0 0 7 8 9 10

How to merge multiple dummy variables columns which were created from a single categorical variable into single column in python?

I am working on IPL dataset which has many categorical variables one such variable is toss_winner. I have created dummy variable for this and now I have 15 columns with binary values. I want to merge all these column into single column with numbers 0-14 each number representing IPL team.
IIUC, Use:
df['Team No.'] = dummies.cumsum(axis=1).ne(1).sum(axis=1)
Example,
df = pd.DataFrame({'Toss winner': ['Chennai', 'Mumbai', 'Rajasthan', 'Banglore', 'Hyderabad']})
dummies = pd.get_dummies(df['Toss winner'])
df['Team No.'] = dummies.cumsum(axis=1).ne(1).sum(axis=1)
Result:
# print(dummies)
Banglore Chennai Hyderabad Mumbai Rajasthan
0 0 1 0 0 0
1 0 0 0 1 0
2 0 0 0 0 1
3 1 0 0 0 0
4 0 0 1 0 0
# print (df)
Toss winner Team No.
0 Chennai 1
1 Mumbai 3
2 Rajasthan 4
3 Banglore 0
4 Hyderabad 2

How to clean up columns with values '10-12' (represented in range) in pandas dataframe?

I have Car Sales price dataset, where I am trying to predict the sales price given the features of a car. I have a variable called 'Fuel Economy city' which is having values like 10,12,10-12,13-14,.. in pandas dataframe. I need to convert this into numerical to apply regression algorithm. I don't have domain knowledge about automobiles. Please help.
I tried removing the hyphen, but it is treating as a four digit value which I don't think is correct in this context.
You could try pd.get_dummies() which will make a separate column for the various ranges, marking each column True (1) or False (0). These can then be used in lieu of the ranges (which are considered categorical data.)
import pandas as pd
data = [[10,"blue", "Ford"], [12,"green", "Chevy"],["10-12","white", "Chrysler"],["13-14", "red", "Subaru"]]
df = pd.DataFrame(data, columns = ["Fuel Economy city", "Color", "Make"])
print(df)
df = pd.get_dummies(df)
print(df)
OUTPUT:
Fuel Economy city_10 Fuel Economy city_12 Fuel Economy city_10-12 \
0 1 0 0
1 0 1 0
2 0 0 1
3 0 0 0
Fuel Economy city_13-14 Color_blue Color_green Color_red Color_white \
0 0 1 0 0 0
1 0 0 1 0 0
2 0 0 0 0 1
3 1 0 0 1 0
Make_Chevy Make_Chrysler Make_Ford Make_Subaru
0 0 0 1 0
1 1 0 0 0
2 0 1 0 0
3 0 0 0 1

How to detect change in last 2 months starting from specific row in Pandas DataFrame

Let's say we have a dataframe like this:
Id Policy_id Start_Date End_Date Fee1 Fee2 Last_dup
0 b123 2019/02/24 2019/03/23 0 23 0
1 b123 2019/03/24 2019/04/23 0 23 0
2 b123 2019/04/24 2019/05/23 10 23 1
3 c123 2018/09/01 2019/09/30 10 0 0
4 c123 2018/10/01 2019/10/31 10 0 1
5 d123 2017/02/24 2019/03/23 0 0 0
6 d123 2017/03/24 2019/04/23 0 0 1
The column Last_dup is the result of applying .duplicated (answer).
The result of substraction of End_Date and Start_Date in this case is always 30 days for simplification. My goal is to detect change of Fee1 and Fee2 in the last 2 months for each Policy_id.
So first, I want to locate the last element of Policy_id then go up from the last element and compare the fees between months and detect change.
The expected result:
Id Policy_id Start_Date End_Date Fee1 Fee2 Last_dup Changed
0 b123 2019/02/24 2019/03/23 0 23 0 0
1 b123 2019/03/24 2019/04/23 0 23 0 0
2 b123 2019/04/24 2019/05/23 10 23 1 1
3 c123 2018/09/01 2019/09/30 10 0 0 0
4 c123 2018/10/01 2019/10/31 10 0 1 0
5 d123 2017/02/24 2019/03/23 0 0 0 0
6 d123 2017/03/24 2019/04/23 0 0 1 0
I need to start for the specific row where Last_dup is 1 then go up and compare change of FeeX. Thanks!
I think adding a "transaction number column" for each policy will make this easier. Then you can just de-dupe the transactions to see if there are "changed" rows.
Look at the following for example:
import pandas as pd
dat = [['b123', 234, 522], ['b123', 234, 522], ['c123', 34, 23],
['c123', 38, 23], ['c123', 34, 23]]
cols = ['Policy_id', 'Fee1', 'Fee2']
df = pd.DataFrame(dat, columns=cols)
df['transaction_id'] = 1
df['transaction_id'] = df.groupby('Policy_id').cumsum()['transaction_id']
df2 = df[cols].drop_duplicates()
final_df = df2.join(df[['transaction_id']])
The output is:
Policy_id Fee1 Fee2 transaction_id
0 b123 234 522 1
2 c123 34 23 1
3 c123 38 23 2
And since b123 only has one transaction after de-duping, you know that nothing changed. Something had to change with c123.
You can get all the changed transactions with final_df[final_df.transaction_id > 1].
As mentioned, you might have to do some other math with the dates, but this should get you most of the way there.
Edit: If you want to only look at the last two months, you can filter the DataFrame prior to running the above.
How to do this:
Make a variable for your filtered date like so:
from datetime import date, timedelta
filtered_date = date.today() - timedelta(days=60)
Then I would use the pyjanitor package to use its filter_date method. Just filter on whatever column is the column that you want; I thought that Start_date appears most reasonable.
import janitor
final_df.filter_date("Start_date", start=filtered_date)
Once you run import janitor, final_df will magically have the filter_date method available.
You can see more filter_date examples here.

How to Hot Encode with Pandas without combining rows levels

I have created a really big dataframe in pandas like similar to the following:
0 1
user
0 product4 product0
1 product3 product1
I want to use something, like pd.get_dummies(), in such a way that the final df would be like:
product0 product1 product2 product3 product4
user
0 1 0 0 0 1
1 0 1 0 1 0
instead of getting the following from pd.get_dummies():
0_product3 0_product4 1_product0 1_product1
user
0 0 1 1 0
1 1 0 0 1
In summary, I do not want that the rows are combined into the binary columns.
Thanks a lot!
Use reindex with get_dummies
In [539]: dff = pd.get_dummies(df, prefix='', prefix_sep='')
In [540]: s = dff.columns.str[-1].astype(int)
In [541]: cols = 'product' + pd.RangeIndex(s.min(), s.max()+1).astype(str)
In [542]: dff.reindex(columns=cols, fill_value=0)
Out[542]:
product0 product1 product2 product3 product4
user
0 1 0 0 0 1
1 0 1 0 1 0
df = pd.get_dummies(df, prefix='', prefix_sep='') # remove prefix from dummy column names and underscore
df = df.sort_index(axis=1) # order data by column names

Categories