Python Pandas count function on condition and subset - python

i have a dataframe like this
F_Class Product Packages
Apple Apple_A 1
Apple Apple_A 2
Apple Apple_A 1
Apple Apple_B 2
Bananas Banana_A n.a.
Bananas Banana_A n.a.
I want to build the following count function to count the items in my dataframe like shown below.
The Function should count by the Subset ['F_Class','Product']
If df['Packages'] == 2 then increase by +2 else increase by +1
The result should look like this:
F_Class Product Packages Counter
Apple Apple_A 1 1
Apple Apple_A 2 3
Apple Apple_A 1 4
Apple Apple_B 2 2
Bananas Banana_A n.a. 1
Bananas Banana_A n.a. 2

If need sum by Packages numbers use DataFrameGroupBy.cumsum with replace missing values to 1:
df['Packages'] = pd.to_numeric(df['Packages'], errors='coerce')
df['Counter'] = (df.assign(Packages = df['Packages'].fillna(1).astype(int))
.groupby(['F_Class','Product'])['Packages'].cumsum())
print (df)
F_Class Product Packages Counter
0 Apple Apple_A 1.0 1
1 Apple Apple_A 2.0 3
2 Apple Apple_A 1.0 4
3 Apple Apple_B 2.0 2
4 Bananas Banana_A NaN 1
5 Bananas Banana_A NaN 2
Detail:
print (df.assign(Packages = df['Packages'].fillna(1).astype(int)))
F_Class Product Packages
0 Apple Apple_A 1
1 Apple Apple_A 2
2 Apple Apple_A 1
3 Apple Apple_B 2
4 Bananas Banana_A 1
5 Bananas Banana_A 1

Use df.groupby() together with df.transform() as follows:
df['Counter'] = (df.groupby(['F_Class','Product'])['Packages']
.transform(lambda x: x.eq('2').add(1).cumsum()))
print(df)
F_Class Product Packages Counter
0 Apple Apple_A 1 1
1 Apple Apple_A 2 3
2 Apple Apple_A 1 4
3 Apple Apple_B 2 2
4 Bananas Banana_A n.a. 1
5 Bananas Banana_A n.a. 2
If your values in column Packages are integer rather than string, modify '2' to 2:
df['Counter'] = (df.groupby(['F_Class','Product'])['Packages']
.transform(lambda x: x.eq(2).add(1).cumsum()))

Related

Pandas: Create column with rolling sum of previous n rows of another column for within the same id/group

Sample dataset:
id fruit
0 7 NaN
1 7 apple
2 7 NaN
3 7 mango
4 7 apple
5 7 potato
6 3 berry
7 3 olive
8 3 olive
9 3 grape
10 3 NaN
11 3 mango
12 3 potato
In fruit column value of NaN and potato is 0. All other strings value is 1. I want to generate a new column sum_last_3 where each row calculates the sum of previous 3 rows (inclusive) of fruit column. When a new id appears, it should calculate from the beginning.
Output I want:
id fruit sum_last3
0 7 NaN 0
1 7 apple 1
2 7 NaN 1
3 7 mango 2
4 7 apple 2
5 7 potato 2
6 3 berry 1
7 3 olive 2
8 3 olive 3
9 3 grape 3
10 3 NaN 2
11 3 mango 2
12 3 potato 1
My Code:
df['sum_last5'] = (df['fruit'].ne('potato') & df['fruit'].notna())
.groupby('id',sort=False, as_index=False)['fruit']
.rolling(min_periods=1, window=3).sum().astype(int).values
You can modify your codes slightly, as follows:
df['sum_last3'] = ((df['fruit'].ne('potato') & df['fruit'].notna())
.groupby(df['id'],sort=False)
.rolling(min_periods=1, window=3).sum().astype(int)
.droplevel(0)
)
or use .values as in your codes:
df['sum_last3'] = ((df['fruit'].ne('potato') & df['fruit'].notna())
.groupby(df['id'],sort=False)
.rolling(min_periods=1, window=3).sum().astype(int)
.values
)
Your codes are close, just need to change id to df['id'] in the .groupby() call (since the main subject for calling .groupby() is now a boolean series rather than df itself, so .groupby() cannot recognize the id column by the column label 'id' alone and need also the dataframe name to fully qualify/identify the column).
Also remove as_index=False since this parameter is for dataframe rather than (boolean) series here.
Result:
print(df)
id fruit sum_last3
0 7 NaN 0
1 7 apple 1
2 7 NaN 1
3 7 mango 2
4 7 apple 2
5 7 potato 2
6 3 berry 1
7 3 olive 2
8 3 olive 3
9 3 grape 3
10 3 NaN 2
11 3 mango 2
12 3 potato 1

Handling values with multiple items in dataframe

Supposed my dataframe
Name Value
0 A apple
1 A banana
2 A orange
3 B grape
4 B apple
5 C apple
6 D apple
7 D orange
8 E banana
I want to show the items of each name.
(By removing duplicates)
output what I want
Name Values
0 A apple, banana, orange
1 B grape, apple
2 C apple
3 D apple, orange
4 E banana
thank you for reading
Changed sample data with duplicates:
print (df)
Name Value
0 A apple
1 A apple
2 A banana
3 A banana
4 A orange
5 B grape
6 B apple
7 C apple
8 D apple
9 D orange
10 E banana
If duplicates per both columns is necessary remove first use DataFrame.drop_duplicates and then aggregate join:
df1 = (df.drop_duplicates(['Name','Value'])
.groupby('Name')['Value']
.agg(','.join)
.reset_index())
print (df1)
Name Value
0 A apple,banana,orange
1 B grape,apple
2 C apple
3 D apple,orange
4 E banana
If not removed duplicates output is:
df2 = (df.groupby('Name')['Value']
.agg(','.join)
.reset_index())
print (df2)
Name Value
0 A apple,apple,banana,banana,orange
1 B grape,apple
2 C apple
3 D apple,orange
4 E banana

How to normalize columns with one-hot encoding efficiently in pandas dataframes?

A column of an example dataframe is shown:
Fruit FruitA FruitB
Apple Banana Mango
Banana Apple Apple
Mango Apple Banana
Banana Mango Banana
Mango Banana Apple
Apple Mango Mango
I want to introduce new columns in the dataframe Fruit-Apple, Fruit-Mango, Fruit-Banana with one-hot encoding in the rows they are respectively present. So, the desired output is:
Fruit FruitA FruitB Fruit-Apple Fruit-Banana Fruit-Mango
Apple Banana Mango 1 1 1
Banana Apple Apple 1 1 0
Mango Apple Banana 1 1 1
Banana Mango Banana 0 1 1
Mango Banana Apple 1 1 1
Apple Mango Mango 1 0 1
My code to do this is:
for i in range(len(data)):
if (data['Fruits'][i] == 'Apple' or data['FruitsA'][i] == 'Apple' or data['FruitsB'][i] == 'Apple'):
data['Fruits-Apple'][i]=1
data['Fruits-Banana'][i]=0
data['Fruits-Mango'][i]=0
elif (data['Fruits'][i] == 'Banana' or data['FruitsA'][i] == 'Banana' or data['FruitsB'][i] == 'Banana'):
data['Fruits-Apple'][i]=0
data['Fruits-Banana'][i]=1
data['Fruits-Mango'][i]=0
elif (data['Fruits'][i] == 'Mango' or data['FruitsA'][i] == 'Mango' or data['FruitsB'][i] == 'Mango'):
data['Fruits-Apple'][i]=0
data['Fruits-Banana'][i]=0
data['Fruits-Mango'][i]=1
But I notice that the time taken for running this code increases dramatically if there are a lot of types of 'fruits'. In my actual data, there are only 1074 rows, and the column I'm trying to "normalize" with one-hot encoding has 18 different values. So, there are 18 if conditions inside the for loop, and the code hasn't finished running for 15 mins now. That's absurd (It would be great to know why its taking so long - in another column that contained only 6 different types of values, the code took much less time to execute, about 3 mins).
So, what's the best (vectorized) way to achieve this output?
Use join with get_dummies and add_prefix:
df = df.join(pd.get_dummies(df['Fruit']).add_prefix('Fruit-'))
print (df)
Fruit Fruit-Apple Fruit-Banana Fruit-Mango
0 Apple 1 0 0
1 Banana 0 1 0
2 Mango 0 0 1
3 Banana 0 1 0
4 Mango 0 0 1
5 Apple 1 0 0
EDIT: If input are multiple columns use get_dummies with max by columns:
df = (df.join(pd.get_dummies(df, prefix='', prefix_sep='')
.max(level=0, axis=1)
.add_prefix('Fruit-')))
print (df)
Fruit FruitA FruitB Fruit-Apple Fruit-Banana Fruit-Mango
0 Apple Banana Mango 1 1 1
1 Banana Apple Apple 1 1 0
2 Mango Apple Banana 1 1 1
3 Banana Mango Banana 0 1 1
4 Mango Banana Apple 1 1 1
5 Apple Mango Mango 1 0 1
For better performance use MultiLabelBinarizer with DataFrame converted to lists:
from sklearn.preprocessing import MultiLabelBinarizer
mlb = MultiLabelBinarizer()
df = df.join(pd.DataFrame(mlb.fit_transform(df.values.tolist()),
columns=mlb.classes_,
index=df.index).add_prefix('Fruit-'))
print (df)
Fruit FruitA FruitB Fruit-Apple Fruit-Banana Fruit-Mango
0 Apple Banana Mango 1 1 1
1 Banana Apple Apple 1 1 0
2 Mango Apple Banana 1 1 1
3 Banana Mango Banana 0 1 1
4 Mango Banana Apple 1 1 1
5 Apple Mango Mango 1 0 1

how to map multiple records to one unique id

I have 2 data sets with a common unique ID(duplicates in 2nd data frame)
I want to map all records with respect to each ID.
df1
id
1
2
3
4
5
df2
id col1
1 mango
2 melon
1 straw
3 banana
3 papaya
i want the out put like
df1
id col1
1 mango
straw
2 melon
3 banana
papaya
4 not available
5 not available
Thanks in advance
You're looking to do an outer df.merge:
df1 = df1.merge(df2, how='outer').set_index('id').fillna('not available')
>>> df1
col1
id
1 mango
1 straw
2 melon
3 banana
3 papaya
4 not available
5 not available

Using Pandas Data Frame how to apply count to multi level grouped columns?

I have a data frame with multiple columns and I want to use count after group by such that it is applied to the combination of 2 or more columns. for example, let's say I have two columns:
user_id product_name
1 Apple
1 Banana
1 Apple
2 Carrot
2 Tomato
2 Carrot
2 Tomato
3 Milk
3 Cucumber
...
What I want to achieve is something like this:
user_id product_name Product_Count_per_User
1 Apple 1
1 Banana 2
2 Carrot 2
2 Tomato 2
3 Milk 1
3 Cucumber 1
I cannot get it. I tried this:
dcf6 = df3.groupby(['user_id','product_name'])['user_id', 'product_name'].count()
but it does not seem to get what I want and it is displaying 4 columns instead of 3. How to do to it? Thanks.
You are counting two columns at the same time, you can just use groupby.size:
(df.groupby(['user_id', 'Product_Name']).size()
.rename('Product_Count_per_User').reset_index())
Or count only one column:
df.groupby(['user_id','Product_Name'])['user_id'].size()
Use GroupBy.size:
dcf6 = df3.groupby(['user_id','Product_Name']).size()
.reset_index(name='Product_Count_per_User')
print (dcf6)
user_id Product_Name Product_Count_per_User
0 1 Apple 2
1 1 Banana 1
2 2 Carrot 2
3 2 Tomato 2
4 3 Cucumber 1
5 3 Milk 1
What is the difference between size and count in pandas?
Base on your own code , just do this .
df.groupby(['user_id','product_name'])['user_id'].
agg({'Product_Count_per_User':'count'}).reset_index(level=1)
product_name Product_Count_per_User
user_id
1 Apple 2
1 Banana 1
2 Carrot 2
2 Tomato 2
3 Cucumber 1
3 Milk 1

Categories