I have a input dataframe as
ID Visit11 Visit12 Visit13 Visit1Int4 Visit15
1 Orange
2 Orange Apple
3 Grapes
4 Apple
5 Orange Apple
6 Apple
7 Banana
8 Banana Apple Banana Apple Banana
I want to fill the first NA of each row with 'Exit' (SO for ID 1, Visit12 should be 'Exit', for ID2 Visit13 should be 'Exit', etc.). The final output should look like
ID Visit11 Visit12 Visit13 Visit1Int4 Visit15
1 Orange Exit
2 Orange Apple Exit
3 Grapes Exit
4 Apple Exit
5 Orange Apple Exit
6 Apple Exit
7 Banana Exit
8 Banana Apple Banana Apple Banana E
You could start by replacing empty values with np.nan, and take the cumsum of DataFrame.isna. Then use np.where to assign Exit where cumsum is 1, or the value in df otherwise:
import numpy as np
m = df.replace('',np.nan).isna().cumsum(axis=1)
r = np.where(m == 1, 'Exit', df)
pd.DataFrame(r, columns=df.columns).fillna('')
ID Visit11 Visit12 Visit13 Visit1Int4 Visit15
0 1 Orange Exit
1 2 Orange Apple Exit
2 3 Grapes Exit
3 4 Apple Exit
4 5 Orange Apple Exit
5 6 Apple Exit
6 7 Banana Exit
7 8 Banana Apple Banana Apple Banana
Related
Let's say we have a example dataframe like below,
df = pd.DataFrame(np.array([['strawberry', 'red', 3], ['apple', 'red', 6], ['apple', 'red', 5],
['banana', 'yellow', 9], ['pineapple', 'yellow', 5], ['pineapple', 'yellow', 7],
['apple', 'green', 2],['apple', 'green', 6], ['kiwi', 'green', 6]
]),
columns=['Fruit', 'Color', 'Quantity'])
df
Fruit Color Quantity
0 strawberry red 3
1 apple red 6
2 apple red 5
3 banana yellow 9
4 pineapple yellow 5
5 pineapple yellow 7
6 apple green 2
7 apple green 6
8 kiwi green 6
In this df, I' m checking is there any change in Fruit column row by row.
With shift() method rows are offsetting by 1, with fillna() method NaN values are filled and lastly with ne() method True-False labeling is done.
So as you can check from index 1, strawberry changing to apple, it will be "True".
Index 2, there are no change, it will be "False".
df['Fruit_Check'] = df.Fruit.shift().fillna(df.Fruit).ne(df.Fruit)
df
Fruit Color Quantity Fruit_Check
0 strawberry red 3 False
1 apple red 6 True
2 apple red 5 False
3 banana yellow 9 True
4 pineapple yellow 5 True
5 pineapple yellow 7 False
6 apple green 2 True
7 apple green 6 False
8 kiwi green 6 True
My problem is: I want to check also "Color" column. If there is a change in there, Fruit_Check column must be False default. So df should look like this,
df
Fruit Color Quantity Fruit_Check
0 strawberry red 3 False
1 apple red 6 True
2 apple red 5 False
3 banana yellow 9 False
4 pineapple yellow 5 True
5 pineapple yellow 7 False
6 apple green 2 False
7 apple green 6 False
8 kiwi green 6 True
Also I shouldn't use for loop. Because when I use with my original data, it takes too much time.
Use DataFrameGroupBy.shift for shift per groups:
df['Fruit_Check'] = df.groupby('Color').Fruit.shift().fillna(df.Fruit).ne(df.Fruit)
print (df)
Fruit Color Quantity Fruit_Check
0 strawberry red 3 False
1 apple red 6 True
2 apple red 5 False
3 banana yellow 9 False
4 pineapple yellow 5 True
5 pineapple yellow 7 False
6 apple green 2 False
7 apple green 6 False
8 kiwi green 6 True
Supposed my dataframe
Name Value
0 A apple
1 A banana
2 A orange
3 B grape
4 B apple
5 C apple
6 D apple
7 D orange
8 E banana
I want to show the items of each name.
(By removing duplicates)
output what I want
Name Values
0 A apple, banana, orange
1 B grape, apple
2 C apple
3 D apple, orange
4 E banana
thank you for reading
Changed sample data with duplicates:
print (df)
Name Value
0 A apple
1 A apple
2 A banana
3 A banana
4 A orange
5 B grape
6 B apple
7 C apple
8 D apple
9 D orange
10 E banana
If duplicates per both columns is necessary remove first use DataFrame.drop_duplicates and then aggregate join:
df1 = (df.drop_duplicates(['Name','Value'])
.groupby('Name')['Value']
.agg(','.join)
.reset_index())
print (df1)
Name Value
0 A apple,banana,orange
1 B grape,apple
2 C apple
3 D apple,orange
4 E banana
If not removed duplicates output is:
df2 = (df.groupby('Name')['Value']
.agg(','.join)
.reset_index())
print (df2)
Name Value
0 A apple,apple,banana,banana,orange
1 B grape,apple
2 C apple
3 D apple,orange
4 E banana
A column of an example dataframe is shown:
Fruit FruitA FruitB
Apple Banana Mango
Banana Apple Apple
Mango Apple Banana
Banana Mango Banana
Mango Banana Apple
Apple Mango Mango
I want to introduce new columns in the dataframe Fruit-Apple, Fruit-Mango, Fruit-Banana with one-hot encoding in the rows they are respectively present. So, the desired output is:
Fruit FruitA FruitB Fruit-Apple Fruit-Banana Fruit-Mango
Apple Banana Mango 1 1 1
Banana Apple Apple 1 1 0
Mango Apple Banana 1 1 1
Banana Mango Banana 0 1 1
Mango Banana Apple 1 1 1
Apple Mango Mango 1 0 1
My code to do this is:
for i in range(len(data)):
if (data['Fruits'][i] == 'Apple' or data['FruitsA'][i] == 'Apple' or data['FruitsB'][i] == 'Apple'):
data['Fruits-Apple'][i]=1
data['Fruits-Banana'][i]=0
data['Fruits-Mango'][i]=0
elif (data['Fruits'][i] == 'Banana' or data['FruitsA'][i] == 'Banana' or data['FruitsB'][i] == 'Banana'):
data['Fruits-Apple'][i]=0
data['Fruits-Banana'][i]=1
data['Fruits-Mango'][i]=0
elif (data['Fruits'][i] == 'Mango' or data['FruitsA'][i] == 'Mango' or data['FruitsB'][i] == 'Mango'):
data['Fruits-Apple'][i]=0
data['Fruits-Banana'][i]=0
data['Fruits-Mango'][i]=1
But I notice that the time taken for running this code increases dramatically if there are a lot of types of 'fruits'. In my actual data, there are only 1074 rows, and the column I'm trying to "normalize" with one-hot encoding has 18 different values. So, there are 18 if conditions inside the for loop, and the code hasn't finished running for 15 mins now. That's absurd (It would be great to know why its taking so long - in another column that contained only 6 different types of values, the code took much less time to execute, about 3 mins).
So, what's the best (vectorized) way to achieve this output?
Use join with get_dummies and add_prefix:
df = df.join(pd.get_dummies(df['Fruit']).add_prefix('Fruit-'))
print (df)
Fruit Fruit-Apple Fruit-Banana Fruit-Mango
0 Apple 1 0 0
1 Banana 0 1 0
2 Mango 0 0 1
3 Banana 0 1 0
4 Mango 0 0 1
5 Apple 1 0 0
EDIT: If input are multiple columns use get_dummies with max by columns:
df = (df.join(pd.get_dummies(df, prefix='', prefix_sep='')
.max(level=0, axis=1)
.add_prefix('Fruit-')))
print (df)
Fruit FruitA FruitB Fruit-Apple Fruit-Banana Fruit-Mango
0 Apple Banana Mango 1 1 1
1 Banana Apple Apple 1 1 0
2 Mango Apple Banana 1 1 1
3 Banana Mango Banana 0 1 1
4 Mango Banana Apple 1 1 1
5 Apple Mango Mango 1 0 1
For better performance use MultiLabelBinarizer with DataFrame converted to lists:
from sklearn.preprocessing import MultiLabelBinarizer
mlb = MultiLabelBinarizer()
df = df.join(pd.DataFrame(mlb.fit_transform(df.values.tolist()),
columns=mlb.classes_,
index=df.index).add_prefix('Fruit-'))
print (df)
Fruit FruitA FruitB Fruit-Apple Fruit-Banana Fruit-Mango
0 Apple Banana Mango 1 1 1
1 Banana Apple Apple 1 1 0
2 Mango Apple Banana 1 1 1
3 Banana Mango Banana 0 1 1
4 Mango Banana Apple 1 1 1
5 Apple Mango Mango 1 0 1
I have 2 data sets with a common unique ID(duplicates in 2nd data frame)
I want to map all records with respect to each ID.
df1
id
1
2
3
4
5
df2
id col1
1 mango
2 melon
1 straw
3 banana
3 papaya
i want the out put like
df1
id col1
1 mango
straw
2 melon
3 banana
papaya
4 not available
5 not available
Thanks in advance
You're looking to do an outer df.merge:
df1 = df1.merge(df2, how='outer').set_index('id').fillna('not available')
>>> df1
col1
id
1 mango
1 straw
2 melon
3 banana
3 papaya
4 not available
5 not available
I have a dataframe with lots of rows. Sometimes are values are one ofs and not very useful for my purpose.
How can I remove all the rows from where columns 2 and 3's value doesn't appear more than 5 times?
df input
Col1 Col2 Col3 Col4
1 apple tomato banana
1 apple potato banana
1 apple tomato banana
1 apple tomato banana
1 apple tomato banana
1 apple tomato banana
1 grape tomato banana
1 pear tomato banana
1 lemon tomato banana
output
Col1 Col2 Col3 Col4
1 apple tomato banana
1 apple tomato banana
1 apple tomato banana
1 apple tomato banana
1 apple tomato banana
Global Counts
Use stack + value_counts + replace -
v = df[['Col2', 'Col3']]
df[v.replace(v.stack().value_counts()).gt(5).all(1)]
Col1 Col2 Col3 Col4
0 1 apple tomato banana
2 1 apple tomato banana
3 1 apple tomato banana
4 1 apple tomato banana
5 1 apple tomato banana
(Update)
Columnwise Counts
Call apply with pd.Series.value_counts on your columns of interest, and filter in the same manner as before -
v = df[['Col2', 'Col3']]
df[v.replace(v.apply(pd.Series.value_counts)).gt(5).all(1)]
Col1 Col2 Col3 Col4
0 1 apple tomato banana
2 1 apple tomato banana
3 1 apple tomato banana
4 1 apple tomato banana
5 1 apple tomato banana
Details
Use value_counts to count values in your dataframe -
c = v.apply(pd.Series.value_counts)
c
Col2 Col3
apple 6.0 NaN
grape 1.0 NaN
lemon 1.0 NaN
pear 1.0 NaN
potato NaN 1.0
tomato NaN 8.0
Call replace, to replace values in the DataFrame with their counts -
i = v.replace(c)
i
Col2 Col3
0 6 8
1 6 1
2 6 8
3 6 8
4 6 8
5 6 8
6 1 8
7 1 8
8 1 8
From that point,
m = i.gt(5).all(1)
0 True
1 False
2 True
3 True
4 True
5 True
6 False
7 False
8 False
dtype: bool
Use the mask to index df.
Easy way with transform
counts_col2 = df.groupby("Col2")["Col2"].transform(len)
counts_col3 = df.groupby("Col3")["Col3"].transform(len)
mask = (counts_col2 > 5) & (counts_col3 > 5)
df[mask]
output:
Col1 Col2 Col3 Col4
0 1 apple tomato banana
2 1 apple tomato banana
3 1 apple tomato banana
4 1 apple tomato banana
5 1 apple tomato banana
v=df.astype(str).sum(1)
df[v.eq(v.value_counts()[v.value_counts()>=5].index.values[0])]
Out[145]:
Col1 Col2 Col3 Col4
0 1 apple tomato banana
2 1 apple tomato banana
3 1 apple tomato banana
4 1 apple tomato banana
5 1 apple tomato banana
To create the example data frame
import pandas as pd
text = '''Col1 Col2 Col3 Col4
1 apple tomato banana
1 apple potato banana
1 apple tomato banana
1 apple tomato banana
1 apple tomato banana
1 apple tomato banana
1 grape tomato banana
1 pear tomato banana
1 lemon tomato banana'''
count = 1
data = []
for line in text.split('\n'):
if count == 1:
headers = line.split()
else:
data.append(line.split())
count += 1
df = pd.DataFrame(data = data,columns=headers)
The value_counts method produces a dict with unique column values as the keys and a count as the value. It is these keys I am assigning to k.
value_counts returns a Pandas series object but it is like a dict
This list comprehension has a filtering 'if' statement that ignores keys if the value associated with it isn't > 5
In this example, it returns a list with only one value, but it other cases it could be more.
Col2_more_than_5 = [k for k in df['Col2'].value_counts().keys()
if df['Col2'].value_counts()[k] > 5]
Col3_more_than_5 = [k for k in df['Col3'].value_counts().keys()
if df['Col3'].value_counts()[k] > 5]
I now have two lists that contain the string/s that occur > 5 times in each column and now I create a selector that returns rows where both statements are true
df[(df['Col2'].isin(Col2_more_than_5)) & (df['Col3'].isin(Col3_more_than_5))]
The 'isin' method works if there are more than 1 value in the list
Fastest way by #ALollz
def agg_size_nosort(df):
counts_col2 = df.groupby("Col2", sort=False)["Col2"].transform('size')
counts_col3 = df.groupby("Col3", sort=False)["Col3"].transform('size')
mask = (counts_col2 > 5) & (counts_col3 > 5)
return df[mask]
One can also use filter twice.
df.groupby("Col2").filter(lambda x: len(x) >= 5) \
.groupby("Col3").filter(lambda x: len(x) >= 5)
The documentation of filter
says
Return a copy of a DataFrame excluding elements from groups that do
not satisfy the boolean criterion specified by func.