I have 2 data sets with a common unique ID(duplicates in 2nd data frame)
I want to map all records with respect to each ID.
df1
id
1
2
3
4
5
df2
id col1
1 mango
2 melon
1 straw
3 banana
3 papaya
i want the out put like
df1
id col1
1 mango
straw
2 melon
3 banana
papaya
4 not available
5 not available
Thanks in advance
You're looking to do an outer df.merge:
df1 = df1.merge(df2, how='outer').set_index('id').fillna('not available')
>>> df1
col1
id
1 mango
1 straw
2 melon
3 banana
3 papaya
4 not available
5 not available
Related
Supposed my dataframe
Name Value
0 A apple
1 A banana
2 A orange
3 B grape
4 B apple
5 C apple
6 D apple
7 D orange
8 E banana
I want to show the items of each name.
(By removing duplicates)
output what I want
Name Values
0 A apple, banana, orange
1 B grape, apple
2 C apple
3 D apple, orange
4 E banana
thank you for reading
Changed sample data with duplicates:
print (df)
Name Value
0 A apple
1 A apple
2 A banana
3 A banana
4 A orange
5 B grape
6 B apple
7 C apple
8 D apple
9 D orange
10 E banana
If duplicates per both columns is necessary remove first use DataFrame.drop_duplicates and then aggregate join:
df1 = (df.drop_duplicates(['Name','Value'])
.groupby('Name')['Value']
.agg(','.join)
.reset_index())
print (df1)
Name Value
0 A apple,banana,orange
1 B grape,apple
2 C apple
3 D apple,orange
4 E banana
If not removed duplicates output is:
df2 = (df.groupby('Name')['Value']
.agg(','.join)
.reset_index())
print (df2)
Name Value
0 A apple,apple,banana,banana,orange
1 B grape,apple
2 C apple
3 D apple,orange
4 E banana
I have two dataframes.
df1 is
name colour count
0 apple red 3
1 orange orange 2
3 kiwi green 12
df2 is
name count
0 kiwi 2
1 apple 1
I want to update df1 with count from df2 based on name match.
Expected result:
name colour count
0 apple red 1
1 orange orange 2
3 kiwi green 2
I am using this but the results are incorrect.
df3 = df2.combine_first(df1).reindex(df1.index)
How can I do this correctly?
Create index by name in both DataFrames for matching by them by DataFrame.set_index, then DataFrame.combine_first and last DataFrame.reset_index for column from index:
df = df2.set_index('name').combine_first(df1.set_index('name')).reset_index()
print (df)
name colour count
0 apple red 1.0
1 kiwi green 2.0
2 orange orange 2.0
A column of an example dataframe is shown:
Fruit FruitA FruitB
Apple Banana Mango
Banana Apple Apple
Mango Apple Banana
Banana Mango Banana
Mango Banana Apple
Apple Mango Mango
I want to introduce new columns in the dataframe Fruit-Apple, Fruit-Mango, Fruit-Banana with one-hot encoding in the rows they are respectively present. So, the desired output is:
Fruit FruitA FruitB Fruit-Apple Fruit-Banana Fruit-Mango
Apple Banana Mango 1 1 1
Banana Apple Apple 1 1 0
Mango Apple Banana 1 1 1
Banana Mango Banana 0 1 1
Mango Banana Apple 1 1 1
Apple Mango Mango 1 0 1
My code to do this is:
for i in range(len(data)):
if (data['Fruits'][i] == 'Apple' or data['FruitsA'][i] == 'Apple' or data['FruitsB'][i] == 'Apple'):
data['Fruits-Apple'][i]=1
data['Fruits-Banana'][i]=0
data['Fruits-Mango'][i]=0
elif (data['Fruits'][i] == 'Banana' or data['FruitsA'][i] == 'Banana' or data['FruitsB'][i] == 'Banana'):
data['Fruits-Apple'][i]=0
data['Fruits-Banana'][i]=1
data['Fruits-Mango'][i]=0
elif (data['Fruits'][i] == 'Mango' or data['FruitsA'][i] == 'Mango' or data['FruitsB'][i] == 'Mango'):
data['Fruits-Apple'][i]=0
data['Fruits-Banana'][i]=0
data['Fruits-Mango'][i]=1
But I notice that the time taken for running this code increases dramatically if there are a lot of types of 'fruits'. In my actual data, there are only 1074 rows, and the column I'm trying to "normalize" with one-hot encoding has 18 different values. So, there are 18 if conditions inside the for loop, and the code hasn't finished running for 15 mins now. That's absurd (It would be great to know why its taking so long - in another column that contained only 6 different types of values, the code took much less time to execute, about 3 mins).
So, what's the best (vectorized) way to achieve this output?
Use join with get_dummies and add_prefix:
df = df.join(pd.get_dummies(df['Fruit']).add_prefix('Fruit-'))
print (df)
Fruit Fruit-Apple Fruit-Banana Fruit-Mango
0 Apple 1 0 0
1 Banana 0 1 0
2 Mango 0 0 1
3 Banana 0 1 0
4 Mango 0 0 1
5 Apple 1 0 0
EDIT: If input are multiple columns use get_dummies with max by columns:
df = (df.join(pd.get_dummies(df, prefix='', prefix_sep='')
.max(level=0, axis=1)
.add_prefix('Fruit-')))
print (df)
Fruit FruitA FruitB Fruit-Apple Fruit-Banana Fruit-Mango
0 Apple Banana Mango 1 1 1
1 Banana Apple Apple 1 1 0
2 Mango Apple Banana 1 1 1
3 Banana Mango Banana 0 1 1
4 Mango Banana Apple 1 1 1
5 Apple Mango Mango 1 0 1
For better performance use MultiLabelBinarizer with DataFrame converted to lists:
from sklearn.preprocessing import MultiLabelBinarizer
mlb = MultiLabelBinarizer()
df = df.join(pd.DataFrame(mlb.fit_transform(df.values.tolist()),
columns=mlb.classes_,
index=df.index).add_prefix('Fruit-'))
print (df)
Fruit FruitA FruitB Fruit-Apple Fruit-Banana Fruit-Mango
0 Apple Banana Mango 1 1 1
1 Banana Apple Apple 1 1 0
2 Mango Apple Banana 1 1 1
3 Banana Mango Banana 0 1 1
4 Mango Banana Apple 1 1 1
5 Apple Mango Mango 1 0 1
I have a dataframe with lots of rows. Sometimes are values are one ofs and not very useful for my purpose.
How can I remove all the rows from where columns 2 and 3's value doesn't appear more than 5 times?
df input
Col1 Col2 Col3 Col4
1 apple tomato banana
1 apple potato banana
1 apple tomato banana
1 apple tomato banana
1 apple tomato banana
1 apple tomato banana
1 grape tomato banana
1 pear tomato banana
1 lemon tomato banana
output
Col1 Col2 Col3 Col4
1 apple tomato banana
1 apple tomato banana
1 apple tomato banana
1 apple tomato banana
1 apple tomato banana
Global Counts
Use stack + value_counts + replace -
v = df[['Col2', 'Col3']]
df[v.replace(v.stack().value_counts()).gt(5).all(1)]
Col1 Col2 Col3 Col4
0 1 apple tomato banana
2 1 apple tomato banana
3 1 apple tomato banana
4 1 apple tomato banana
5 1 apple tomato banana
(Update)
Columnwise Counts
Call apply with pd.Series.value_counts on your columns of interest, and filter in the same manner as before -
v = df[['Col2', 'Col3']]
df[v.replace(v.apply(pd.Series.value_counts)).gt(5).all(1)]
Col1 Col2 Col3 Col4
0 1 apple tomato banana
2 1 apple tomato banana
3 1 apple tomato banana
4 1 apple tomato banana
5 1 apple tomato banana
Details
Use value_counts to count values in your dataframe -
c = v.apply(pd.Series.value_counts)
c
Col2 Col3
apple 6.0 NaN
grape 1.0 NaN
lemon 1.0 NaN
pear 1.0 NaN
potato NaN 1.0
tomato NaN 8.0
Call replace, to replace values in the DataFrame with their counts -
i = v.replace(c)
i
Col2 Col3
0 6 8
1 6 1
2 6 8
3 6 8
4 6 8
5 6 8
6 1 8
7 1 8
8 1 8
From that point,
m = i.gt(5).all(1)
0 True
1 False
2 True
3 True
4 True
5 True
6 False
7 False
8 False
dtype: bool
Use the mask to index df.
Easy way with transform
counts_col2 = df.groupby("Col2")["Col2"].transform(len)
counts_col3 = df.groupby("Col3")["Col3"].transform(len)
mask = (counts_col2 > 5) & (counts_col3 > 5)
df[mask]
output:
Col1 Col2 Col3 Col4
0 1 apple tomato banana
2 1 apple tomato banana
3 1 apple tomato banana
4 1 apple tomato banana
5 1 apple tomato banana
v=df.astype(str).sum(1)
df[v.eq(v.value_counts()[v.value_counts()>=5].index.values[0])]
Out[145]:
Col1 Col2 Col3 Col4
0 1 apple tomato banana
2 1 apple tomato banana
3 1 apple tomato banana
4 1 apple tomato banana
5 1 apple tomato banana
To create the example data frame
import pandas as pd
text = '''Col1 Col2 Col3 Col4
1 apple tomato banana
1 apple potato banana
1 apple tomato banana
1 apple tomato banana
1 apple tomato banana
1 apple tomato banana
1 grape tomato banana
1 pear tomato banana
1 lemon tomato banana'''
count = 1
data = []
for line in text.split('\n'):
if count == 1:
headers = line.split()
else:
data.append(line.split())
count += 1
df = pd.DataFrame(data = data,columns=headers)
The value_counts method produces a dict with unique column values as the keys and a count as the value. It is these keys I am assigning to k.
value_counts returns a Pandas series object but it is like a dict
This list comprehension has a filtering 'if' statement that ignores keys if the value associated with it isn't > 5
In this example, it returns a list with only one value, but it other cases it could be more.
Col2_more_than_5 = [k for k in df['Col2'].value_counts().keys()
if df['Col2'].value_counts()[k] > 5]
Col3_more_than_5 = [k for k in df['Col3'].value_counts().keys()
if df['Col3'].value_counts()[k] > 5]
I now have two lists that contain the string/s that occur > 5 times in each column and now I create a selector that returns rows where both statements are true
df[(df['Col2'].isin(Col2_more_than_5)) & (df['Col3'].isin(Col3_more_than_5))]
The 'isin' method works if there are more than 1 value in the list
Fastest way by #ALollz
def agg_size_nosort(df):
counts_col2 = df.groupby("Col2", sort=False)["Col2"].transform('size')
counts_col3 = df.groupby("Col3", sort=False)["Col3"].transform('size')
mask = (counts_col2 > 5) & (counts_col3 > 5)
return df[mask]
One can also use filter twice.
df.groupby("Col2").filter(lambda x: len(x) >= 5) \
.groupby("Col3").filter(lambda x: len(x) >= 5)
The documentation of filter
says
Return a copy of a DataFrame excluding elements from groups that do
not satisfy the boolean criterion specified by func.
I have a data frame with multiple columns and I want to use count after group by such that it is applied to the combination of 2 or more columns. for example, let's say I have two columns:
user_id product_name
1 Apple
1 Banana
1 Apple
2 Carrot
2 Tomato
2 Carrot
2 Tomato
3 Milk
3 Cucumber
...
What I want to achieve is something like this:
user_id product_name Product_Count_per_User
1 Apple 1
1 Banana 2
2 Carrot 2
2 Tomato 2
3 Milk 1
3 Cucumber 1
I cannot get it. I tried this:
dcf6 = df3.groupby(['user_id','product_name'])['user_id', 'product_name'].count()
but it does not seem to get what I want and it is displaying 4 columns instead of 3. How to do to it? Thanks.
You are counting two columns at the same time, you can just use groupby.size:
(df.groupby(['user_id', 'Product_Name']).size()
.rename('Product_Count_per_User').reset_index())
Or count only one column:
df.groupby(['user_id','Product_Name'])['user_id'].size()
Use GroupBy.size:
dcf6 = df3.groupby(['user_id','Product_Name']).size()
.reset_index(name='Product_Count_per_User')
print (dcf6)
user_id Product_Name Product_Count_per_User
0 1 Apple 2
1 1 Banana 1
2 2 Carrot 2
3 2 Tomato 2
4 3 Cucumber 1
5 3 Milk 1
What is the difference between size and count in pandas?
Base on your own code , just do this .
df.groupby(['user_id','product_name'])['user_id'].
agg({'Product_Count_per_User':'count'}).reset_index(level=1)
product_name Product_Count_per_User
user_id
1 Apple 2
1 Banana 1
2 Carrot 2
2 Tomato 2
3 Cucumber 1
3 Milk 1