I have a dataframe with two columns A and B that contains lists:
import pandas as pd
df = pd.DataFrame({"A" : [[1,5,10], [], [2], [1,2]],
"B" : [[15, 2], [], [6], []]})
I want to construct a third column C that is defined such that it is equal to the smallest possible difference between list-elements in A and B if they are non-empty, and 0 if one or both of them are empty.
For the first row the smallest difference is 1 (we take absolute value..), for the second row it is 0 due to lists being empty, third row is 4 and fourth row is 0 again due to one empty list, so we ultimately end up with:
df["C"] = [1, 0, 4, 0]
This isn't easily vectorisable, since you have object dtype series of lists. You can use a list comprehension with itertools.product:
from itertools import product
zipper = zip(df['A'], df['B'])
df['C'] = [min((abs(x - y) for x, y in product(*vals)), default=0) for vals in zipper]
# alternative:
# df['C'] = [min((abs(x - y) for x, y in product(*vals)), default=0) \
# for vals in df[['A', 'B']].values]
print(df)
# A B C
# 0 [1, 5, 10] [15, 2] 1
# 1 [] [] 0
# 2 [2] [6] 4
# 3 [1, 2] [] 0
You can use the following list comprehension, checking for the min difference of the cartesian product (itertools.product) from both columns
[min(abs(i-j) for i,j in product(*a)) if all(a) else 0 for a in df.values]
[1, 0, 4, 0]
df['C'] = df.apply(lambda row: min([abs(x - y) for x in row['A'] for y in row['B']], default=0), axis=1)
I just want to introduce the unnesting again
df['Diff']=unnesting(df[['B']],['B']).join(unnesting(df[['A']],['A'])).eval('C=B-A').C.abs().min(level=0)
df.Diff=df.Diff.fillna(0).astype(int)
df
Out[60]:
A B Diff
0 [1, 5, 10] [15, 2] 1
1 [] [] 0
2 [2] [6] 4
3 [1, 2] [] 0
FYI
def unnesting(df, explode):
idx=df.index.repeat(df[explode[0]].str.len())
df1=pd.concat([pd.DataFrame({x:np.concatenate(df[x].values)} )for x in explode],axis=1)
df1.index=idx
return df1.join(df.drop(explode,1),how='left')
I think this works
def diff(a,b):
if len(a) > 0 and len(b) > 0:
return min([abs(i-j) for i in a for j in b])
return 0
df['C'] = df.apply(lambda x: diff(x.A, x.B), axis=1)
df
A B C
0 [1, 5, 10] [15, 2] 1
1 [] [] 0
2 [2] [6] 4
3 [1, 2] [] 0
Related
i am very new to pandas can anybody tell me how to map uniquely lists for a dataframe?
Data
[phone, laptop]
[life, death, mortal]
[happy]
Expected output:
[1,2]
[3,4,5]
[6]
I used map() and enumerate but both give me errors.
For efficiency, use a list comprehension.
For simple counts:
from itertools import count
c = count(1)
df['new'] = [[next(c) for x in l ] for l in df['Data']]
For unique identifiers in case of duplicates:
from itertools import count
c = count(1)
d = {}
df['new'] = [[d[x] if x in d else d.setdefault(x, next(c)) for x in l ] for l in df['Data']]
Output:
Data new
0 [phone, laptop] [1, 2]
1 [life, death, mortal] [3, 4, 5]
2 [happy] [6]
You could explode, replace, and groupby to undo the explode operation:
df = pd.DataFrame({"data": [["phone", "laptop"],
["life", "death", "mortal"],
["happy", "phone"]]})
df_expl = df.explode("data")
df["data_mapped"] = (
df_expl.assign(data=lambda df: range(1, len(df) + 1))
.groupby(df_expl.index).data.apply(list))
print(df)
data data_mapped
0 [phone, laptop] [1, 2]
1 [life, death, mortal] [3, 4, 5]
2 [happy, phone] [6, 7]
This always increments the counter, even if list items are duplicates.
In case duplicates should have unique integer values, use factorize instead of range:
df_expl = df.explode("data")
df["data_mapped"] = (
df_expl.assign(data=df_expl.data.factorize()[0] + 1)
.groupby(df_expl.index).data.apply(list))
print(df)
# output:
data data_mapped
0 [phone, laptop] [1, 2]
1 [life, death, mortal] [3, 4, 5]
2 [happy, phone] [6, 1]
I have a DataFrame as below
df = pd.DataFrame({
'x' : range(0,5),
'y' : [[0,2],[3,4],[2,3],[3,4],[7,9]]
})
I would like to test for each row of x, if the value is in the list specified by column y
df[df.x.isin(df.y)]
so I would end up with:
Not sure why isin() does not work in this case
df.x.isin(df.y) checks for each element x, e.g. 0, is equal to some of the values of df.y, e.g. is 0 equal to [0,2], no, and so on.
With this, you can just do a for loop:
df[ [x in y for x,y in zip(df['x'], df['y'])] ]
Let us try explode with index loc
out = df.loc[df.explode('y').query('x==y').index.unique()]
Out[217]:
x y
0 0 [0, 2]
2 2 [2, 3]
3 3 [3, 4]
Just an other solution:
result = (
df
.assign(origin_y = df.y)
.explode('y')
.query('x==y')
.drop(columns=['y'])
.rename({'origin_y': 'y'})
)
x y
0 0 [0, 2]
2 2 [2, 3]
3 3 [3, 4]
Here's an example of how the output should look like:
Dataframe: df with required output
class_id item req_output
a 1 [1]
a 2 [1,2]
a 3 [1,2,3]
b 1 [1]
b 2 [1,2]
I've tried:
df.groupby("class").apply(lambda x: list(x["item"])
class_id output
a [1,2,3]
b [1,2]
but this only gives the whole aggregation, however I need the aggregation to happen in every row considering the class
First, make each element into a list of size 1. Here, we are (exploitingabusing?) the fact [1] + [2] = [1, 2]. Then group by and GroupBy.apply Series.cumsum.
df["req_output"] = (
df["item"]
.map(lambda x: [x])
.groupby(df["class_id"])
.apply(lambda x: x.cumsum())
)
class_id item req_output
0 a 1 [1]
1 a 2 [1, 2]
2 a 3 [1, 2, 3]
3 b 1 [1]
4 b 2 [1, 2]
Or we can make a function to return the desired list and use GroupBy.transform.
def get_slices(s):
"""
>>> get_slices( pd.Series([1, 2, 3]) )
[[1], [1, 2], [1, 2, 3]]
"""
lst = s.tolist()
return [lst[:i] for i in range(1, len(lst)+1)]
df['req_output'] = df.groupby('class_id')['item'].transform(get_slices)
I have a dataframe that looks like this:
ID AgeGroups PaperIDs
0 1 [3, 3, 10] [A, B, C]
1 2 [5] [D]
2 3 [4, 12] [A, D]
3 4 [2, 6, 13, 12] [X, Z, T, D]
I would like the extract the rows where the list in the AgeGroups column has at least 2 values less than 7 and at least 1 value greater than 8.
So the result should look like this:
ID AgeGroups PaperIDs
0 1 [3, 3, 10] [A, B, C]
3 4 [2, 6, 13, 12] [X, Z, T, D]
I'm not sure how to do it.
First create helper DataFrame and compare by DataFrame.lt and
DataFrame.gt, then Series by Series.ge and chain masks by & for bitwise AND:
import ast
#if not lists
#df['AgeGroups'] = df['AgeGroups'].apply(ast.literal_eval)
df1 = pd.DataFrame(df['AgeGroups'].tolist())
df = df[df1.lt(7).sum(axis=1).ge(2) & df1.gt(8).sum(axis=1).ge(1)]
print (df)
ID AgeGroups PaperIDs
0 1 [3, 3, 10] [A, B, C]
3 4 [2, 6, 13, 12] [X, Z, T, D]
Or use list comprehension with compare numpy arrays, counts by sum and compare both counts chained by and, because scalars:
m = [(np.array(x) < 7).sum() >= 2 and (np.array(x) > 8).sum() >=1 for x in df['AgeGroups']]
df = df[m]
print (df)
ID AgeGroups PaperIDs
0 1 [3, 3, 10] [A, B, C]
3 4 [2, 6, 13, 12] [X, Z, T, D]
Simple if else logic I wrote for each row using apply function, you can also use list comprehension for row.
data = {'ID':['1', '2', '3', '4'], 'AgeGroups':[[3,3,10],[2],[4,12],[2,6,13,12]],'PaperIDs':[['A','B','C'],['D'],['A','D'],['X','Z','T','D']]}
df = pd.DataFrame(data)
def extract_age(row):
my_list = row['AgeGroups']
count1 = 0
count2 = 0
if len(my_list)>=3:
for i in my_list:
if i<7:
count1 = count1 +1
elif i>8:
count2 = count2+1
if (count1 >= 2) and (count2 >=1):
print(row['AgeGroups'],row['PaperIDs'])
df.apply(lambda x: extract_age(x), axis =1)
Output
[3, 3, 10] ['A', 'B', 'C']
[2, 6, 13, 12] ['X', 'Z', 'T', 'D']
Say I have the following dataframe:
df =pd.DataFrame({'col1':[5,'',2], 'col2':['','',1], 'col3':[9,'','']})
print(df)
col1 col2 col3
5 9
1
2 2 1
Is there a simple way to turn it into a pd.Series of lists, avoiding empty elements? So:
0 [5,9]
1 [1]
2 [2,2,1]
You can try using df.values
Just take df.values. Convert them into list and remove empty elements using map:
In [2193]: df
Out[2193]:
col1 col2 col3
0 5 9
1 1
2 2 2 1
One-liner:
In [2186]: pd.Series(df.values.tolist()).map(lambda row: [x for x in row if x != ''])
Out[2186]:
0 [5, 9]
1 [1]
2 [2, 2, 1]
dtype: object
You can use this
In[1]: [x[x.apply(lambda k: k != '')].tolist() for i, x in df.iterrows()]
Out[1]: [[5, 9], [], [2, 1]]
Similar to #jezreal's solution. But if you do not expect 0 values, you can use the inherent False-ness of empty strings:
L = [x[x.astype(bool)].tolist() for i, x in df.T.items()]
res = pd.Series(L, index=df.index)
Can be done as follows:
# Break down into list of tuples
records = df.to_records().tolist()
# Convert tuples into lists
series = pd.Series(records).map(list)
# Get rid of empty strings
series.map(lambda row: list(filter(lambda x: x != '', row)))
# ... alternatively
series.map(lambda row: [x for x in row if x != ''])
resulting in
0 [0, 5, 9]
1 [1]
2 [2, 2, 1]
Use list comprehension with remove empty value:
L = [x[x != ''].tolist() for i, x in df.T.items()]
s = pd.Series(L, index=df.index)
Or convert values to lists by to_dict with parameter split:
L = df.to_dict(orient='split')['data']
print (L)
[[5, '', 9], ['', '', ''], [2, 1, '']]
And then remove empty values:
s = pd.Series([[y for y in x if y != ''] for x in L], index=df.index)
print (s)
0 [5, 9]
1 []
2 [2, 1]
dtype: object