Merge Duplicates based on column?

Merge Duplicates based on column? - python

Here's my situation -
In[9]: df
Out[9]:
fruit val1 val2
0 Orange 1 1
1 orANGE 2 2
2 apple 3 3
3 APPLE 4 4
4 mango 5 5
5 appLE 6 6
In[10]: type(df)
Out[10]: pandas.core.frame.DataFrame
How do remove case-insensitive duplicates such that resulting fruit will be all lower with val1 as sum of each val1s and val2 as sum of eachval2s
Expected result:
fruit val1 val2
0 orange 3 3
1 apple 13 13
2 mango 5 5

In two steps:
df['fruit'] = df['fruit'].map(lambda x: x.lower())
res = df.groupby('fruit').sum()
res
# val1 val2
# fruit
# apple 13 13
# mango 5 5
# orange 3 3
And to recover your structure:
res.reset_index()
as per the comment, the lower casing can be accomplished in a more straight forward way like this:
df['fruit'] = df['fruit'].str.lower()

Related

Search N consecutive rows with same value in one dataframe

I need to create a python code to search "N" as variable, consecutive rows in a column dataframe with the same value and different that NaN like this.
I can't figure out how to do it with a for loop because I don't know which row I'm looking at in each case. Any idea that how can do it?
Fruit
2 matches
5 matches
Apple
No
No
NaN
No
No
Pear
No
No
Pear
Yes
No
Pear
Yes
No
Pear
Yes
No
Pear
Yes
Yes
NaN
No
No
NaN
No
No
NaN
No
No
NaN
No
No
NaN
No
No
Banana
No
No
Banana
Yes
No
Update: testing solutions by #Corralien
counts = (df.groupby(df['Fruit'].ne(df['Fruit'].shift()).cumsum()) # virtual groups
.transform('cumcount').add(1) # cumulative counter
.where(df['Fruit'].notna(), other=0)) # set NaN to 0
N = 2
df['Matches'] = df.where(counts >= N, other='No')
VSCode return me the 'Frame skipped from debugging during step-in.' message when execute the last line and generate an exception in the previous for loop.

Compute consecutive values and set NaN to 0. Once you have calculated the cumulative counter, you just have to check if the counter is greater than or equal to N:
counts = (df.groupby(df['Fruit'].ne(df['Fruit'].shift()).cumsum()) # virtual groups
.transform('cumcount').add(1) # cumulative counter
.where(df['Fruit'].notna(), other=0)) # set NaN to 0
N = 2
df['2 matches'] = counts.ge(N).replace({True: 'Yes', False: 'No'})
N = 5
df['5 matches'] = counts.ge(N).replace({True: 'Yes', False: 'No'})
Output:
>>> df
Fruit 2 matches 5 matches
0 Apple No No
1 NaN No No
2 Pear No No
3 Pear Yes No
4 Pear Yes No
5 Pear Yes No
6 Pear Yes Yes
7 NaN No No
8 NaN No No
9 NaN No No
10 NaN No No
11 NaN No No
12 Banana No No
13 Banana Yes No
>>> counts
0 1
1 0
2 1
3 2
4 3
5 4
6 5
7 0
8 0
9 0
10 0
11 0
12 1
13 2
dtype: int64
Update
if I need to change "Yes" for the fruit name for example
N = 2
df['2 matches'] = df.where(counts >= N, other='No')
print(df)
# Output
Fruit 2 matches
0 Apple No
1 NaN No
2 Pear No
3 Pear Pear
4 Pear Pear
5 Pear Pear
6 Pear Pear
7 NaN No
8 NaN No
9 NaN No
10 NaN No
11 NaN No
12 Banana No
13 Banana Banana

Pandas: Create column with rolling sum of previous n rows of another column for within the same id/group

Sample dataset:
id fruit
0 7 NaN
1 7 apple
2 7 NaN
3 7 mango
4 7 apple
5 7 potato
6 3 berry
7 3 olive
8 3 olive
9 3 grape
10 3 NaN
11 3 mango
12 3 potato
In fruit column value of NaN and potato is 0. All other strings value is 1. I want to generate a new column sum_last_3 where each row calculates the sum of previous 3 rows (inclusive) of fruit column. When a new id appears, it should calculate from the beginning.
Output I want:
id fruit sum_last3
0 7 NaN 0
1 7 apple 1
2 7 NaN 1
3 7 mango 2
4 7 apple 2
5 7 potato 2
6 3 berry 1
7 3 olive 2
8 3 olive 3
9 3 grape 3
10 3 NaN 2
11 3 mango 2
12 3 potato 1
My Code:
df['sum_last5'] = (df['fruit'].ne('potato') & df['fruit'].notna())
.groupby('id',sort=False, as_index=False)['fruit']
.rolling(min_periods=1, window=3).sum().astype(int).values

You can modify your codes slightly, as follows:
df['sum_last3'] = ((df['fruit'].ne('potato') & df['fruit'].notna())
.groupby(df['id'],sort=False)
.rolling(min_periods=1, window=3).sum().astype(int)
.droplevel(0)
)
or use .values as in your codes:
df['sum_last3'] = ((df['fruit'].ne('potato') & df['fruit'].notna())
.groupby(df['id'],sort=False)
.rolling(min_periods=1, window=3).sum().astype(int)
.values
)
Your codes are close, just need to change id to df['id'] in the .groupby() call (since the main subject for calling .groupby() is now a boolean series rather than df itself, so .groupby() cannot recognize the id column by the column label 'id' alone and need also the dataframe name to fully qualify/identify the column).
Also remove as_index=False since this parameter is for dataframe rather than (boolean) series here.
Result:
print(df)
id fruit sum_last3
0 7 NaN 0
1 7 apple 1
2 7 NaN 1
3 7 mango 2
4 7 apple 2
5 7 potato 2
6 3 berry 1
7 3 olive 2
8 3 olive 3
9 3 grape 3
10 3 NaN 2
11 3 mango 2
12 3 potato 1

Handling values with multiple items in dataframe

Supposed my dataframe
Name Value
0 A apple
1 A banana
2 A orange
3 B grape
4 B apple
5 C apple
6 D apple
7 D orange
8 E banana
I want to show the items of each name.
(By removing duplicates)
output what I want
Name Values
0 A apple, banana, orange
1 B grape, apple
2 C apple
3 D apple, orange
4 E banana
thank you for reading

Changed sample data with duplicates:
print (df)
Name Value
0 A apple
1 A apple
2 A banana
3 A banana
4 A orange
5 B grape
6 B apple
7 C apple
8 D apple
9 D orange
10 E banana
If duplicates per both columns is necessary remove first use DataFrame.drop_duplicates and then aggregate join:
df1 = (df.drop_duplicates(['Name','Value'])
.groupby('Name')['Value']
.agg(','.join)
.reset_index())
print (df1)
Name Value
0 A apple,banana,orange
1 B grape,apple
2 C apple
3 D apple,orange
4 E banana
If not removed duplicates output is:
df2 = (df.groupby('Name')['Value']
.agg(','.join)
.reset_index())
print (df2)
Name Value
0 A apple,apple,banana,banana,orange
1 B grape,apple
2 C apple
3 D apple,orange
4 E banana

update pandas groupby group with column value

I have a test df like this:
df = pd.DataFrame({'A': ['Apple','Apple', 'Apple','Orange','Orange','Orange','Pears','Pears'],
'B': [1,2,9,6,4,3,2,1]
})
A B
0 Apple 1
1 Apple 2
2 Apple 9
3 Orange 6
4 Orange 4
5 Orange 3
6 Pears 2
7 Pears 1
Now I need to add a new column with the respective %differences in col 'B'. How is this possible. I cannot get this to work.
I have looked at
update column value of pandas groupby().last()
Not sure that it is pertinent to my problem.
And this which looks promising
Pandas Groupby and Sum Only One Column
I need to find and insert into the col maxpercchng (all rows in group) the maximum change in col (B) per group of col 'A'.
So I have come up with this code:
grouppercchng = ((df.groupby['A'].max() - df.groupby['A'].min())/df.groupby['A'].iloc[0])*100
and try to add it to the group col 'maxpercchng' like so
group['maxpercchng'] = grouppercchng
Or like so
df_kpi_hot.groupby(['A'], as_index=False)['maxpercchng'] = grouppercchng
Does anyone know how to add to all rows in group the maxpercchng col?

I believe you need transform for Series with same size like original DataFrame filled by aggregated values:
g = df.groupby('A')['B']
df['maxpercchng'] = (g.transform('max') - g.transform('min')) / g.transform('first') * 100
print (df)
A B maxpercchng
0 Apple 1 800.0
1 Apple 2 800.0
2 Apple 9 800.0
3 Orange 6 50.0
4 Orange 4 50.0
5 Orange 3 50.0
6 Pears 2 50.0
7 Pears 1 50.0
Or:
g = df.groupby('A')['B']
df1 = ((g.max() - g.min()) / g.first() * 100).reset_index()
print (df1)
A B
0 Apple 800.0
1 Orange 50.0
2 Pears 50.0

how to map multiple records to one unique id

I have 2 data sets with a common unique ID(duplicates in 2nd data frame)
I want to map all records with respect to each ID.
df1
id
1
2
3
4
5
df2
id col1
1 mango
2 melon
1 straw
3 banana
3 papaya
i want the out put like
df1
id col1
1 mango
straw
2 melon
3 banana
papaya
4 not available
5 not available
Thanks in advance

You're looking to do an outer df.merge:
df1 = df1.merge(df2, how='outer').set_index('id').fillna('not available')
>>> df1
col1
id
1 mango
1 straw
2 melon
3 banana
3 papaya
4 not available
5 not available

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Merge Duplicates based on column? - python

Related

Search N consecutive rows with same value in one dataframe

Pandas: Create column with rolling sum of previous n rows of another column for within the same id/group

Handling values with multiple items in dataframe

update pandas groupby group with column value

how to map multiple records to one unique id

Categories

Resources