Capture all unique information by group - python

I want to create a unique dataset of fruits. I don't know all the types (e.g. colour store, price) that could be under each fruit. For each type, there could also be duplicate rows. Is there a way to detect all possible duplicates and capture all unique informoation in a fully generalisable way?
type val detail
0 fruit apple
1 colour green greenish
2 colour yellow
3 store walmart usa
4 price 10
5 NaN
6 fruit banana
7 colour yellow
8 fruit pear
9 fruit jackfruit
...
Expected Output
fruit colour store price detail ...
0 apple [green, yellow ] [walmart] [10] [greenish, usa]
1 banana [yellow] NaN NaN
2 pear NaN NaN NaN
3 jackfruit NaN NaN NaN
I tried. But this does not get close to the expected output. It does not show the colum names either.
df.groupby("type")["val"].agg(size=len, set=lambda x: set(x))
0 fruit {"apple",...}
1 colour ...

First is created fruit column with val values if type is fruit, replace non matched values to NaNs and forward filling missing values, then pivoting by DataFrame.pivot_table with custom function for unique values without NaNs and then flatten MultiIndex:
m = df['type'].eq('fruit')
df['fruit'] = df['val'].where(m).ffill()
df1 = (df.pivot_table(index='fruit',columns='type',
aggfunc=lambda x: list(dict.fromkeys(x.dropna())))
.drop('fruit', axis=1, level=1))
df1.columns = df1.columns.map(lambda x: f'{x[0]}_{x[1]}')
print (df1)
detail_colour detail_price detail_store val_colour val_price \
fruit
apple [greenish] [] [usa] [green, yellow] [10]
banana [] NaN NaN [yellow] NaN
jackfruit NaN NaN NaN NaN NaN
pear NaN NaN NaN NaN NaN
val_store
fruit
apple [walmart]
banana NaN
jackfruit NaN
pear NaN

Related

Search N consecutive rows with same value in one dataframe

I need to create a python code to search "N" as variable, consecutive rows in a column dataframe with the same value and different that NaN like this.
I can't figure out how to do it with a for loop because I don't know which row I'm looking at in each case. Any idea that how can do it?
Fruit
2 matches
5 matches
Apple
No
No
NaN
No
No
Pear
No
No
Pear
Yes
No
Pear
Yes
No
Pear
Yes
No
Pear
Yes
Yes
NaN
No
No
NaN
No
No
NaN
No
No
NaN
No
No
NaN
No
No
Banana
No
No
Banana
Yes
No
Update: testing solutions by #Corralien
counts = (df.groupby(df['Fruit'].ne(df['Fruit'].shift()).cumsum()) # virtual groups
.transform('cumcount').add(1) # cumulative counter
.where(df['Fruit'].notna(), other=0)) # set NaN to 0
N = 2
df['Matches'] = df.where(counts >= N, other='No')
VSCode return me the 'Frame skipped from debugging during step-in.' message when execute the last line and generate an exception in the previous for loop.
Compute consecutive values and set NaN to 0. Once you have calculated the cumulative counter, you just have to check if the counter is greater than or equal to N:
counts = (df.groupby(df['Fruit'].ne(df['Fruit'].shift()).cumsum()) # virtual groups
.transform('cumcount').add(1) # cumulative counter
.where(df['Fruit'].notna(), other=0)) # set NaN to 0
N = 2
df['2 matches'] = counts.ge(N).replace({True: 'Yes', False: 'No'})
N = 5
df['5 matches'] = counts.ge(N).replace({True: 'Yes', False: 'No'})
Output:
>>> df
Fruit 2 matches 5 matches
0 Apple No No
1 NaN No No
2 Pear No No
3 Pear Yes No
4 Pear Yes No
5 Pear Yes No
6 Pear Yes Yes
7 NaN No No
8 NaN No No
9 NaN No No
10 NaN No No
11 NaN No No
12 Banana No No
13 Banana Yes No
>>> counts
0 1
1 0
2 1
3 2
4 3
5 4
6 5
7 0
8 0
9 0
10 0
11 0
12 1
13 2
dtype: int64
Update
if I need to change "Yes" for the fruit name for example
N = 2
df['2 matches'] = df.where(counts >= N, other='No')
print(df)
# Output
Fruit 2 matches
0 Apple No
1 NaN No
2 Pear No
3 Pear Pear
4 Pear Pear
5 Pear Pear
6 Pear Pear
7 NaN No
8 NaN No
9 NaN No
10 NaN No
11 NaN No
12 Banana No
13 Banana Banana

Merge 2 dfs, with the row if it is the only row that contains the word

I have 2 pandas data frames:
df1 = pd.DataFrame({'keyword': ['Sox','Sox','Jays','D', 'Jays'],
'val':[1,2,3,4,5]})
df2 = pd.DataFrame({'name': ['a b c', 'Sox Red', 'Blue Jays White Sox'],
'city':[f'city-{i}' for i in [1,2,3]],
'info': [5, 6, 7]})
>>> df1
keyword val
0 Sox 1
1 Sox 2
2 Jays 3
3 D 4
4 Jays 5
>>> df2
name city info
0 a b c city-1 5
1 Sox Red city-2 6
2 Blue Jays White Sox city-3 7
For each row of df1 the merge should be taking the exact element of df1['keyword'] and see if it is present in each of the df2['name'] elements (e.g. using .str.contains). Now there can be the following options:
if it is present in exactly 1 row of df2['name']: match the current row of df1 with this 1 row of df2.
otherwise (if it is present in more than 1 or in 0 rows of df2['name']): don't match the current row of df1 with anything - the values will be NaN.
The result should look like this:
keyword val name city info
0 Sox 1 NaN NaN NaN
1 Sox 2 NaN NaN NaN
2 Jays 3 Blue Jays White Sox city-3 7.0
3 D 4 NaN NaN NaN
4 Jays 5 Blue Jays White Sox city-3 7.0
Here in the column "keyword":
"Sox" matches multiples lines of df2 (lines 1 and 2), so its merged with NaNs,
"D" matches 0 lines of df2, so it's also merged with NaNs,
"Jays" matches exactly 1 line in df2 (line 2), so it's merged with this line.
How to do this using pandas?
Use a regex and str.extractall to extract the keywords, remove the duplicates with drop_duplicates, and finally merge:
import re
pattern = '|'.join(map(re.escape, df1['keyword']))
# 'Sox|Sox|Jays|D|Jays
key = (df2['name'].str.extractall(fr'\b({pattern})\b')[0]
.droplevel('match')
.drop_duplicates(keep=False)
)
out = df1.merge(df2.assign(keyword=key),
on='keyword', how='left')
print(out)
NB. I'm assuming you want to match full words only, if not remove the word boundaries (\b).
Output:
keyword val name city info
0 Sox 1 NaN NaN NaN
1 Sox 2 NaN NaN NaN
2 Jays 3 Blue Jays White Sox city-3 7.0
3 D 4 NaN NaN NaN
4 Jays 5 Blue Jays White Sox city-3 7.0
One way to do this is to use a combination of .apply() and .str.contains() to find the rows in df2 that match the rows in df1. Then, we can use .merge() to merge the resulting data frames:
def merge_dfs(row):
keyword = row['keyword']
df2_match = df2[df2['name'].str.contains(keyword)]
return df2_match.iloc[0] if len(df2_match) == 1 else pd.Series(dtype='float64')
result = df1.apply(merge_dfs, axis=1).reset_index(drop=True)
result = df1.merge(result, left_index=True, right_index=True, how='left')
This should give the desired result:
>>> result
keyword val city info name
0 Sox 1 NaN NaN NaN
1 Sox 2 NaN NaN NaN
2 Jays 3 city-3 7.0 Blue Jays White Sox
3 D 4 NaN NaN NaN
4 Jays 5 city-3 7.0 Blue Jays White Sox

Flagging NaN values based on a condition and year

I am trying to get this requirement of flagging NaN values based on condition and particular year, below is my code:
import pandas as pd
import numpy as np
s={'Fruits':['Apple','Orange', 'Banana', 'Mango'],'month':['201401','201502','201603','201604'],'weight':[2,4,1,6],'Quant':[251,178,298,300]}
p=pd.DataFrame(data=s)
upper = 250
How would I be able to flag NaN values for month- 201603 and 201604 (03 and 04 are the months), if upper>250. Basically my intention is to check if Quant value is greater than defined upper value, but for specific date i.e. 201603 and 201604.
This is how the output should look like-
Fruits month weight Quant
0 Apple 201401 2 251.0
1 Orange 201502 4 178.0
2 Banana 201603 1 NaN
3 Mango 201604 6 NaN
You can use .loc:
p.loc[(p.Quant > upper) & (p.month.str[-2:].isin(['03','04'])), 'Quant'] = np.nan
OutPut:
Fruits month weight Quant
0 Apple 201401 2 251.0
1 Orange 201502 4 178.0
2 Banana 201603 1 NaN
3 Mango 201604 6 NaN
You could build a boolean condition that checks if "Quant" is greater than "upper" and the month is "03" or "04", and mask "Quant" column:
p['Quant'] = p['Quant'].mask(p['Quant'].gt(upper) & p['month'].str[-2:].isin(['03','04']))
Output:
Fruits month weight Quant
0 Apple 201401 2 251.0
1 Orange 201502 4 178.0
2 Banana 201603 1 NaN
3 Mango 201604 6 NaN
Use:
p['Quant1'] = p[~(((p['month']=='201603')|(p['month']=='201604'))&(p['Quant']>250))]['Quant']

Filling Null Values based on conditions on other columns

I want to fill the Null values in the first column based on the value of the 2nd column.
(For example)
For "Apples" in col2, the value should be 12 in places of Nan in the col1
For "Vegies", in col2 the value should be 134 in place of Nan in col1
For every description, there is a specific code(number) in the 1st column. I need to map it somehow.
(IGNORE the . (dots)
All I can think of is to make a dictionary of codes and replace null but's that very hardcoded.
Can anyone help?
col1. col2
12. Apple
134. Vegies
23. Oranges
Nan. Apples
Nan. Vegies
324. Sugar
Nan. Apples
**Reupdate
Here, I replicate your DF, and the implementation:
import pandas as pd
import numpy as np
l1 = [12, 134, 23, np.nan, np.nan, 324, np.nan,np.nan,np.nan,np.nan]
l2 = ["Apple","Vegies","Oranges","Apples","Vegies","Sugar","Apples","Melon","Melon","Grapes"]
df = pd.DataFrame(l1, columns=["col1"])
df["col2"] = pd.DataFrame(l2)
df
Out[26]:
col1 col2
0 12.0 Apple
1 134.0 Vegies
2 23.0 Oranges
3 NaN Apples
4 NaN Vegies
5 324.0 Sugar
6 NaN Apples
7 NaN Melon
8 NaN Melon
9 NaN Grapes
Then to Replace the Null values based on your rules:
df.loc[df.col2 == "Vegies", 'col1'] = 134
df.loc[df.col2 == "Apple", 'col1'] = 12
If you want to apply these to a larger scales, consider make a dictionary first:
for example is:
item_dict = {"Apples":12, "Melon":65, "Vegies":134, "Grapes":78}
Then apply all of these to your dataframe with this custom function:
def item_mapping(df, dictionary, colsource, coltarget):
dict_keys = list(dictionary.keys())
dict_values = list(dictionary.values())
for x in range(len(dict_keys)):
df.loc[df[colsource]==dict_keys[x], coltarget] = dict_values[x]
return(df)
Usage Examples:
item_mapping(df, item_dict, "col2", "col1")
col1 col2
0 12.0 Apple
1 134.0 Vegies
2 23.0 Oranges
3 12.0 Apples
4 134.0 Vegies
5 324.0 Sugar
6 12.0 Apples
7 65.0 Melon
8 65.0 Melon
9 78.0 Grapes

How to expand a df by different dict as columns?

I have a df with different dicts as entries in a column, in my case column "information". I would like to expand the df by all possible dict.keys(), something like that:
import pandas as pd
import numpy as np
df = pd.DataFrame({'id': pd.Series([1, 2, 3, 4, 5]),
'name': pd.Series(['banana',
'apple',
'orange',
'strawberry' ,
'toast']),
'information': pd.Series([{'shape':'curve','color':'yellow'},
{'color':'red'},
{'shape':'round'},
{'amount':500},
np.nan]),
'cost': pd.Series([1,2,2,10,4])})
id name information cost
0 1 banana {'shape': 'curve', 'color': 'yellow'} 1
1 2 apple {'color': 'red'} 2
2 3 orange {'shape': 'round'} 2
3 4 strawberry {'amount': 500} 10
4 5 toast NaN 4
Should look like this:
id name shape color amount cost
0 1 banana curve yellow NaN 1
1 2 apple NaN red NaN 2
2 3 orange round NaN NaN 2
3 4 strawberry NaN NaN 500.0 10
4 5 toast NaN NaN NaN 4
Another approach would be using pandas.DataFrame.from_records:
import pandas as pd
new = pd.DataFrame.from_records(df.pop('information').apply(lambda x: {} if pd.isna(x) else x))
new = pd.concat([df, new], 1)
print(new)
Output:
cost id name amount color shape
0 1 1 banana NaN yellow curve
1 2 2 apple NaN red NaN
2 2 3 orange NaN NaN round
3 10 4 strawberry 500.0 NaN NaN
4 4 5 toast NaN NaN NaN
You can use:
d = {k: {} if v != v else v for k, v in df.pop('information').items()}
df1 = pd.DataFrame.from_dict(d, orient='index')
df = pd.concat([df, df1], axis=1)
print(df)
id name cost shape color amount
0 1 banana 1 curve yellow NaN
1 2 apple 2 NaN red NaN
2 3 orange 2 round NaN NaN
3 4 strawberry 10 NaN NaN 500.0
4 5 toast 4 NaN NaN NaN

Categories