Group by aggregate elements so far in the same group - Pandas

Group by aggregate elements so far in the same group - Pandas - python

Here's an example of how the output should look like:
Dataframe: df with required output
class_id item req_output
a 1 [1]
a 2 [1,2]
a 3 [1,2,3]
b 1 [1]
b 2 [1,2]
I've tried:
df.groupby("class").apply(lambda x: list(x["item"])
class_id output
a [1,2,3]
b [1,2]
but this only gives the whole aggregation, however I need the aggregation to happen in every row considering the class

First, make each element into a list of size 1. Here, we are (exploitingabusing?) the fact [1] + [2] = [1, 2]. Then group by and GroupBy.apply Series.cumsum.
df["req_output"] = (
df["item"]
.map(lambda x: [x])
.groupby(df["class_id"])
.apply(lambda x: x.cumsum())
)
class_id item req_output
0 a 1 [1]
1 a 2 [1, 2]
2 a 3 [1, 2, 3]
3 b 1 [1]
4 b 2 [1, 2]
Or we can make a function to return the desired list and use GroupBy.transform.
def get_slices(s):
"""
>>> get_slices( pd.Series([1, 2, 3]) )
[[1], [1, 2], [1, 2, 3]]
"""
lst = s.tolist()
return [lst[:i] for i in range(1, len(lst)+1)]
df['req_output'] = df.groupby('class_id')['item'].transform(get_slices)

Related

How to map uniquely list in a dataframe?

i am very new to pandas can anybody tell me how to map uniquely lists for a dataframe?
Data
[phone, laptop]
[life, death, mortal]
[happy]
Expected output:
[1,2]
[3,4,5]
[6]
I used map() and enumerate but both give me errors.

For efficiency, use a list comprehension.
For simple counts:
from itertools import count
c = count(1)
df['new'] = [[next(c) for x in l ] for l in df['Data']]
For unique identifiers in case of duplicates:
from itertools import count
c = count(1)
d = {}
df['new'] = [[d[x] if x in d else d.setdefault(x, next(c)) for x in l ] for l in df['Data']]
Output:
Data new
0 [phone, laptop] [1, 2]
1 [life, death, mortal] [3, 4, 5]
2 [happy] [6]

You could explode, replace, and groupby to undo the explode operation:
df = pd.DataFrame({"data": [["phone", "laptop"],
["life", "death", "mortal"],
["happy", "phone"]]})
df_expl = df.explode("data")
df["data_mapped"] = (
df_expl.assign(data=lambda df: range(1, len(df) + 1))
.groupby(df_expl.index).data.apply(list))
print(df)
data data_mapped
0 [phone, laptop] [1, 2]
1 [life, death, mortal] [3, 4, 5]
2 [happy, phone] [6, 7]
This always increments the counter, even if list items are duplicates.
In case duplicates should have unique integer values, use factorize instead of range:
df_expl = df.explode("data")
df["data_mapped"] = (
df_expl.assign(data=df_expl.data.factorize()[0] + 1)
.groupby(df_expl.index).data.apply(list))
print(df)
# output:
data data_mapped
0 [phone, laptop] [1, 2]
1 [life, death, mortal] [3, 4, 5]
2 [happy, phone] [6, 1]

Counting NaN values by row, in a List

Example data:
dictionary = {'col1':[[1,2,3],[1,2,3],[1,2,3]],'col2': [[1,'nan',3],
[1,2,'nan'],[1,2,3]], 'col3': [[1,2,3],[1,2,3],[1,'nan',3]]}
df = pd.DataFrame(dictionary)
I have a DataFrame of lists and I want to count the number of NaN values by row:
col1 col2 col3
[1,2,3] [1,NaN,3] [1,2,3]
[1,2,3] [1,2,NaN] [1,2,3]
[1,2,3] [1,2,3] [1,NaN,3]
I used
acceptable_data_df.iloc[:, 1:].apply(lambda x: list(itertools.chain(*x)), axis=1)
to convert them to one list and hopefully make it easier but I'm still stuck. (First column was text)
[1,2,3,1,NaN,3,1,2,3]
[1,2,3,1,2,NaN,1,2,3]
[1,2,3,1,2,3,1,NaN,3]
How can I do this?

You could stack + explode + isna to get a Series where it's True for NaN and False otherwise. Then groupby + sum fetches the number of NaN values per row:
df['number of NaN'] = df.stack().explode().isna().groupby(level=0).sum()
Output:
col1 col2 col3 number of NaN
0 [1, 2, 3] [1, nan, 3] [1, 2, 3] 1
1 [1, 2, 3] [1, 2, nan] [1, 2, 3] 1
2 [1, 2, 3] [1, 2, 3] [1, nan, 3] 1

IIUC use:
df['count'] = (df.col1 + df.col2 + df.col3).apply(lambda x: pd.isna(x).sum())

A "boring" way to count "NaN" while keeping the values as a string is to just count how many times the "NaN" string appears:
(df["col1"] + df["col2"] + df["col3"]).str.count("NaN")

appending to lists in column of dataframe

I have a column in a dataframe containing lists of strings as such:
id colname
1 ['str1', 'str2', 'str3']
2 ['str3', 'str4']
3 ['str3']
4 ['str2', 'str5', 'str6']
..
The strings in the list have some overlap as you can see.
I would like to append the lists in this column with a value, for example 'strX'. The end result should be:
id colname
1 ['str1', 'str2', 'str3', 'strX']
2 ['str3', 'str4', 'strX']
3 ['str3', 'strX']
4 ['str2', 'str5', 'str6', 'strX']
..
What would be the proper way to achieve this? I have tried appending and adding, but these don't get me the desired result.

So if you want to append "strx" to all you can do as #jezrael point it out like this:
df = pd.DataFrame({"1": [[1,2,3], [3,4,5,6]]}, index=[1,2])
print(df)
1
1 [1, 2, 3]
2 [3, 4, 5, 6]
df.apply(lambda x: x["1"].append('strx'), axis=1)
print(df)
1
1 [1, 2, 3, strx]
2 [3, 4, 5, 6, strx]
But if you want to add a different value based on index you also could!
Let take the same df but with a dict that precise what to add
dico = {1: "strx", 2: "strj"}
df.apply(lambda x: x["1"].append(dico[x.name]), axis=1)
print(df)
1
1 [1, 2, 3, strx]
2 [3, 4, 5, 6, strj]

You can add list in Series.map or Series.apply:
df['colname'] = df['colname'].map(lambda x: x + ['strX'])
#alternative
#df['colname'] = df['colname'].apply(lambda x: x + ['strX'])
#or in list comprehension
#df['colname'] = [x + ['strX'] for x in df['colname']]
print (df)
id colname
0 1 [str1, str2, str3, strX]
1 2 [str3, str4, strX]
2 3 [str3, strX]
3 4 [str2, str5, str6, strX]

Extracting element from a list in a pandas column based on the value in another column

I have a following pandas df and I would like to extract the element from the list column based on whatever number that is on the num column:
list num
[1,2,3,4,5] 3
[7,2,1,3,4] 4
To obtain:
list num element
[1,2,3,4,5] 3 4
[7,2,1,3,4] 4 4
I have tried:
df['element'] = df['list'].apply(lambda x: x[df['num'].apply(lambda y: y)])
But I got TypeError: list indices must be integers or slices, not Series.
Is there anyway I can do this? Thank you!

Use DataFrame.apply per rows with axis=1:
df['element'] = df.apply(lambda x: x['list'][x['num']], axis=1)
print (df)
list num element
0 [1, 2, 3, 4, 5] 3 4
1 [7, 2, 1, 3, 4] 4 4
Or list comprehension with zip:
df['element'] = [x[y] for x, y in zip(df['list'], df['num'])]
If possible some values not match of list, here is possible use:
def func(a, b):
try:
return a[b]
except Exception:
return np.nan
df['element'] = df.apply(lambda x: func(x['list'], x['num']), axis=1)

Using numpy fancy index
list_val = np.array(df.list.values.tolist())
num_val = df.num.values
df['element'] = list_val[np.arange(df.shape[0]), num_val]
Out[295]:
list num element
0 [1, 2, 3, 4, 5] 3 4
1 [7, 2, 1, 3, 4] 4 4

Finding smallest possible difference between lists of unequal length

I have a dataframe with two columns A and B that contains lists:
import pandas as pd
df = pd.DataFrame({"A" : [[1,5,10], [], [2], [1,2]],
"B" : [[15, 2], [], [6], []]})
I want to construct a third column C that is defined such that it is equal to the smallest possible difference between list-elements in A and B if they are non-empty, and 0 if one or both of them are empty.
For the first row the smallest difference is 1 (we take absolute value..), for the second row it is 0 due to lists being empty, third row is 4 and fourth row is 0 again due to one empty list, so we ultimately end up with:
df["C"] = [1, 0, 4, 0]

This isn't easily vectorisable, since you have object dtype series of lists. You can use a list comprehension with itertools.product:
from itertools import product
zipper = zip(df['A'], df['B'])
df['C'] = [min((abs(x - y) for x, y in product(*vals)), default=0) for vals in zipper]
# alternative:
# df['C'] = [min((abs(x - y) for x, y in product(*vals)), default=0) \
# for vals in df[['A', 'B']].values]
print(df)
# A B C
# 0 [1, 5, 10] [15, 2] 1
# 1 [] [] 0
# 2 [2] [6] 4
# 3 [1, 2] [] 0

You can use the following list comprehension, checking for the min difference of the cartesian product (itertools.product) from both columns
[min(abs(i-j) for i,j in product(*a)) if all(a) else 0 for a in df.values]
[1, 0, 4, 0]

df['C'] = df.apply(lambda row: min([abs(x - y) for x in row['A'] for y in row['B']], default=0), axis=1)

I just want to introduce the unnesting again
df['Diff']=unnesting(df[['B']],['B']).join(unnesting(df[['A']],['A'])).eval('C=B-A').C.abs().min(level=0)
df.Diff=df.Diff.fillna(0).astype(int)
df
Out[60]:
A B Diff
0 [1, 5, 10] [15, 2] 1
1 [] [] 0
2 [2] [6] 4
3 [1, 2] [] 0
FYI
def unnesting(df, explode):
idx=df.index.repeat(df[explode[0]].str.len())
df1=pd.concat([pd.DataFrame({x:np.concatenate(df[x].values)} )for x in explode],axis=1)
df1.index=idx
return df1.join(df.drop(explode,1),how='left')

I think this works
def diff(a,b):
if len(a) > 0 and len(b) > 0:
return min([abs(i-j) for i in a for j in b])
return 0
df['C'] = df.apply(lambda x: diff(x.A, x.B), axis=1)
df
A B C
0 [1, 5, 10] [15, 2] 1
1 [] [] 0
2 [2] [6] 4
3 [1, 2] [] 0

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Group by aggregate elements so far in the same group - Pandas - python

Related

How to map uniquely list in a dataframe?

Counting NaN values by row, in a List

appending to lists in column of dataframe

Extracting element from a list in a pandas column based on the value in another column

Finding smallest possible difference between lists of unequal length

Categories

Resources