I have created a really big dataframe in pandas like similar to the following:
0 1
user
0 product4 product0
1 product3 product1
I want to use something, like pd.get_dummies(), in such a way that the final df would be like:
product0 product1 product2 product3 product4
user
0 1 0 0 0 1
1 0 1 0 1 0
instead of getting the following from pd.get_dummies():
0_product3 0_product4 1_product0 1_product1
user
0 0 1 1 0
1 1 0 0 1
In summary, I do not want that the rows are combined into the binary columns.
Thanks a lot!
Use reindex with get_dummies
In [539]: dff = pd.get_dummies(df, prefix='', prefix_sep='')
In [540]: s = dff.columns.str[-1].astype(int)
In [541]: cols = 'product' + pd.RangeIndex(s.min(), s.max()+1).astype(str)
In [542]: dff.reindex(columns=cols, fill_value=0)
Out[542]:
product0 product1 product2 product3 product4
user
0 1 0 0 0 1
1 0 1 0 1 0
df = pd.get_dummies(df, prefix='', prefix_sep='') # remove prefix from dummy column names and underscore
df = df.sort_index(axis=1) # order data by column names
Related
I have a pandas dataframe like
user_id
music_id
has_rating
A
a
1
B
b
1
and I would like to automatically add new rows for each of user_id & music_id for those users haven't rated, like
user_id
music_id
has_rating
A
a
1
A
b
0
B
a
0
B
b
1
for each of user_id and music_id combination pairs those are not existing in my Pandas dataframe yet.
is there any way to append such rows automatically like this?
You can use a temporary reshape with pivot_table and fill_value=0 to fill the missing values with 0:
(df.pivot_table(index='user_id', columns='music_id',
values='has_rating', fill_value=0)
.stack().reset_index(name='has_rating')
)
Output:
user_id music_id has_rating
0 A a 1
1 A b 0
2 B a 0
3 B b 1
Try using pd.MultiIndex.from_product()
l = ['user_id','music_id']
(df.set_index(l)
.reindex(pd.MultiIndex.from_product([df[l[0]].unique(),df[l[1]].unique()],names = l),fill_value=0)
.reset_index())
Output:
user_id music_id has_rating
0 A a 1
1 A b 0
2 B a 0
3 B b 1
having the following dataframe:
import pandas as pd
cars = ["BMV", "Mercedes", "Audi"]
customer = ["Juan", "Pepe", "Luis"]
price = [100, 200, 300]
year = [2022, 2021, 2020]
df_raw = pd.DataFrame(list(zip(cars, customer, price, year)),\
columns=["cars", "customer", "price", 'year'])
I need to do one-hot encoding from the categorical variables cars and customer, for this I use the get_dummies method for these two columns.
numerical = ["price", "year"]
df_final = pd.concat([df_raw[numerical], pd.get_dummies(df_raw.cars),\
pd.get_dummies(df_raw.customer)], axis=1)
Is there a way to generate these dummies in a dynamic way, like putting them in a list and loop through them with a for.In this case it may seem simple because I only have 2, but if I had 30 or 60 attributes, would I have to go one by one?
pd.get_dummies
pd.get_dummies(df_raw, columns=['cars', 'customer'])
price year cars_Audi cars_BMV cars_Mercedes customer_Juan customer_Luis customer_Pepe
0 100 2022 0 1 0 1 0 0
1 200 2021 0 0 1 0 0 1
2 300 2020 1 0 0 0 1 0
One simple way is to concatenate the columns and use str.get_dummies:
cols = ['cars', 'customer']
out = df_raw.join(df_raw[cols].agg('|'.join, axis=1).str.get_dummies())
output:
cars customer price year Audi BMV Juan Luis Mercedes Pepe
0 BMV Juan 100 2022 0 1 1 0 0 0
1 Mercedes Pepe 200 2021 0 0 0 0 1 1
2 Audi Luis 300 2020 1 0 0 1 0 0
Another option is to melt and use crosstab:
df2 = df_raw[cols].reset_index().melt('index')
out = df_raw.join(pd.crosstab(df2['index'], df2['value']))
I have a multiindex dataframe. Columns 'partner', 'employer', and 'date' are the multiindex.
enter image description here
partner
employer
date
ecom
sales
A
a
10/01/21
1
0
A
a
10/02/21
1
0
A
a
10/03/21
0
1
A
b
10/01/21
0
1
A
b
10/02/21
1
0
A
b
10/03/21
1
0
B
c
10/03/21
1
0
B
c
10/04/21
1
0
B
c
10/04/21
0
1
I'm trying to find which unique (parter, employer) pairs have 'ecom' BEFORE 'sales'. For example, I want to have the output to be. How do I filter through each (partner, employer) pair with these conditions in python?
enter image description here
partner
employer
date
ecom
sales
A
a
10/01/21
1
0
A
a
10/02/21
1
0
A
a
10/03/21
0
1
B
c
10/03/21
1
0
B
c
10/04/21
1
0
B
c
10/04/21
0
1
Try this:
# Find the first date where ecom or sales is not 0
first_date = lambda col: col.first_valid_index()[-1]
tmp = df.replace(0, np.nan).sort_index().groupby(level=[0,1]).agg(
first_ecom=('ecom', first_date),
first_sales=('sales', first_date)
)
# The (partner, employer) pairs where ecom happens before sales
idx = tmp[tmp['first_ecom'] < tmp['first_sales']].index
# Condition to filter the original frame
cond = df.index.droplevel(-1).isin(idx)
# Result
df[cond]
I have DataFrame in Pandas like this:
df = pd.DataFrame({"price_range": [0,1,2,3,0,2], "blue":[0,0,1,0,1,1], "four_g":[0,0,0,1,0,1]})
I have line like this: pd.crosstab(df['price_range'], df["blue"])
Nevertheless, now I only see only for example how many "blue" 0 and 1 is for each "price_range", but I want to exapnd this code and also know how many "four_g" 0 and 1 is for each "price_range". How can do that? Please help me
One way is to use 'melt':
df_out = df.melt('price_range')
pd.crosstab(df_out['price_range'], df_out['variable'], df_out['value'], aggfunc='sum')
Output:
variable blue four_g
price_range
0 1 0
1 0 0
2 2 1
3 0 1
Another way is to use groupby:
df.groupby('price_range')[['blue','four_g']].sum()
Output:
blue four_g
price_range
0 1 0
1 0 0
2 2 1
3 0 1
a simplest way is using 2 crosstab through list comprehension with concat
cols = ['blue', 'four_g']
df_out = pd.concat([pd.crosstab(df['price_range'], df[col])
for col in cols], keys=cols, axis=1)
Out[1116]:
blue four_g
blue 0 1 0 1
price_range
0 1 1 2 0
1 1 0 1 0
2 0 2 1 1
3 1 0 0 1
I have a dataframe like below. The column Mfr Number is a categorical data type. I'd like to preform get_dummies or one hot encoding on it, but instead of filling in the new column with a 1 if it's from that row, I want it to fill in the value from the quantity column. All the other new 'dummies' should remain a 0 on that row. Is this possible?
Datetime Mfr Number quantity
0 2016-03-15 07:02:00 MWS0460MB 1
1 2016-03-15 07:03:00 TM-120-6X 3
2 2016-03-15 08:33:00 40.50699.0095 5
3 2016-03-15 08:42:00 40.50699.0100 1
4 2016-03-15 08:46:00 CXS-04T098-00-0703R-1025 10
Do it in two steps:
dummies = pd.get_dummies(df['Mfr Number'])
dummies.values[dummies != 0] = df['Quantity']
Check with str.get_dummies and mul
df.Number.str.get_dummies().mul(df.quantity,0)
40.50699.0095 40.50699.0100 ... MWS0460MB TM-120-6X
0 0 0 ... 1 0
1 0 0 ... 0 3
2 5 0 ... 0 0
3 0 1 ... 0 0
4 0 0 ... 0 0
[5 rows x 5 columns]
df = pd.get_dummies(df, columns = ['Mfr Number'])
for col in df.columns[2:]:
df[col] = df[col]*df['quantity']