i am very new to pandas can anybody tell me how to map uniquely lists for a dataframe?
Data
[phone, laptop]
[life, death, mortal]
[happy]
Expected output:
[1,2]
[3,4,5]
[6]
I used map() and enumerate but both give me errors.
For efficiency, use a list comprehension.
For simple counts:
from itertools import count
c = count(1)
df['new'] = [[next(c) for x in l ] for l in df['Data']]
For unique identifiers in case of duplicates:
from itertools import count
c = count(1)
d = {}
df['new'] = [[d[x] if x in d else d.setdefault(x, next(c)) for x in l ] for l in df['Data']]
Output:
Data new
0 [phone, laptop] [1, 2]
1 [life, death, mortal] [3, 4, 5]
2 [happy] [6]
You could explode, replace, and groupby to undo the explode operation:
df = pd.DataFrame({"data": [["phone", "laptop"],
["life", "death", "mortal"],
["happy", "phone"]]})
df_expl = df.explode("data")
df["data_mapped"] = (
df_expl.assign(data=lambda df: range(1, len(df) + 1))
.groupby(df_expl.index).data.apply(list))
print(df)
data data_mapped
0 [phone, laptop] [1, 2]
1 [life, death, mortal] [3, 4, 5]
2 [happy, phone] [6, 7]
This always increments the counter, even if list items are duplicates.
In case duplicates should have unique integer values, use factorize instead of range:
df_expl = df.explode("data")
df["data_mapped"] = (
df_expl.assign(data=df_expl.data.factorize()[0] + 1)
.groupby(df_expl.index).data.apply(list))
print(df)
# output:
data data_mapped
0 [phone, laptop] [1, 2]
1 [life, death, mortal] [3, 4, 5]
2 [happy, phone] [6, 1]
Related
Here's an example of how the output should look like:
Dataframe: df with required output
class_id item req_output
a 1 [1]
a 2 [1,2]
a 3 [1,2,3]
b 1 [1]
b 2 [1,2]
I've tried:
df.groupby("class").apply(lambda x: list(x["item"])
class_id output
a [1,2,3]
b [1,2]
but this only gives the whole aggregation, however I need the aggregation to happen in every row considering the class
First, make each element into a list of size 1. Here, we are (exploitingabusing?) the fact [1] + [2] = [1, 2]. Then group by and GroupBy.apply Series.cumsum.
df["req_output"] = (
df["item"]
.map(lambda x: [x])
.groupby(df["class_id"])
.apply(lambda x: x.cumsum())
)
class_id item req_output
0 a 1 [1]
1 a 2 [1, 2]
2 a 3 [1, 2, 3]
3 b 1 [1]
4 b 2 [1, 2]
Or we can make a function to return the desired list and use GroupBy.transform.
def get_slices(s):
"""
>>> get_slices( pd.Series([1, 2, 3]) )
[[1], [1, 2], [1, 2, 3]]
"""
lst = s.tolist()
return [lst[:i] for i in range(1, len(lst)+1)]
df['req_output'] = df.groupby('class_id')['item'].transform(get_slices)
I have a column in a dataframe containing lists of strings as such:
id colname
1 ['str1', 'str2', 'str3']
2 ['str3', 'str4']
3 ['str3']
4 ['str2', 'str5', 'str6']
..
The strings in the list have some overlap as you can see.
I would like to append the lists in this column with a value, for example 'strX'. The end result should be:
id colname
1 ['str1', 'str2', 'str3', 'strX']
2 ['str3', 'str4', 'strX']
3 ['str3', 'strX']
4 ['str2', 'str5', 'str6', 'strX']
..
What would be the proper way to achieve this? I have tried appending and adding, but these don't get me the desired result.
So if you want to append "strx" to all you can do as #jezrael point it out like this:
df = pd.DataFrame({"1": [[1,2,3], [3,4,5,6]]}, index=[1,2])
print(df)
1
1 [1, 2, 3]
2 [3, 4, 5, 6]
df.apply(lambda x: x["1"].append('strx'), axis=1)
print(df)
1
1 [1, 2, 3, strx]
2 [3, 4, 5, 6, strx]
But if you want to add a different value based on index you also could!
Let take the same df but with a dict that precise what to add
dico = {1: "strx", 2: "strj"}
df.apply(lambda x: x["1"].append(dico[x.name]), axis=1)
print(df)
1
1 [1, 2, 3, strx]
2 [3, 4, 5, 6, strj]
You can add list in Series.map or Series.apply:
df['colname'] = df['colname'].map(lambda x: x + ['strX'])
#alternative
#df['colname'] = df['colname'].apply(lambda x: x + ['strX'])
#or in list comprehension
#df['colname'] = [x + ['strX'] for x in df['colname']]
print (df)
id colname
0 1 [str1, str2, str3, strX]
1 2 [str3, str4, strX]
2 3 [str3, strX]
3 4 [str2, str5, str6, strX]
I am trying to add a column with values from a dictionary. It will be easy to show you the dummy data.
df = pd.DataFrame({'id':[1,2,3,2,5], 'grade':[5,2,2,1,3]})
dictionary = {'1':[5,8,6,3], '2':[1,2], '5':[8,6,2]}
Notice that not every id is in the dictionary and the values which are the lists. I want to find the row in the df that matches with the keys in the dictionary and add the list in one column. So the desired output will look like this:
output = pd.DataFrame({'id':[1,2,3,2,5], 'grade':[5,2,2,1,3], 'new_column':[[5,8,6,3],[1,2],[],[1,2],[8,6,2]]})
Is this what you want?
df = df.set_index('id')
dictionary = {1:[5,8,6,3], 2:[1,2], 5:[8,6,2]}
df['new_column'] = pd.Series(dictionary)
Note: The keys of the dictionary need to be the same type (int) as the index of the data frame.
>>> print(df)
gender new_column
id
1 0 [5, 8, 6, 3]
2 0 [1, 2]
3 1 NaN
4 1 NaN
5 1 [8, 6, 2]
Update:
A better solution if 'id' column contains duplicates (see comments below):
df['new_column'] = df['id'].map(dictionary)
import pandas as pd
df = pd.DataFrame({'id':[1,2,3,4,5], 'gender':[0,0,1,1,1]})
dictionary = {'1':[5,8,6,3], '2':[1,2], '5':[8,6,2]}
then just create a list with the values you want and add them to your dataframe
newValues = [ dictionary.get(str(val),[]) for val in df['id'].values]
df['new_column'] = newValues
>>> print(df)
gender new_column
id
1 0 [5, 8, 6, 3]
2 0 [1, 2]
3 1 []
4 1 []
5 1 [8, 6, 2]
You can construct your column using special dictionaries that has a value [] by default.
from collections import defaultdict
default_dictionary = defaultdict(list)
id = [1,2,3,4,5]
dictionary = {'1':[5,8,6,3], '2':[1,2], '5':[8,6,2]}
for n in dictionary:
default_dictionary[n] = dictionary[n]
new_column = [default_dictionary[str(n)] for n in id]
new_column is [[5, 8, 6, 3], [1, 2], [], [], [8, 6, 2]] now and you can pass it to your last argument of pd.DataFrame(...)
I have a following pandas df and I would like to extract the element from the list column based on whatever number that is on the num column:
list num
[1,2,3,4,5] 3
[7,2,1,3,4] 4
To obtain:
list num element
[1,2,3,4,5] 3 4
[7,2,1,3,4] 4 4
I have tried:
df['element'] = df['list'].apply(lambda x: x[df['num'].apply(lambda y: y)])
But I got TypeError: list indices must be integers or slices, not Series.
Is there anyway I can do this? Thank you!
Use DataFrame.apply per rows with axis=1:
df['element'] = df.apply(lambda x: x['list'][x['num']], axis=1)
print (df)
list num element
0 [1, 2, 3, 4, 5] 3 4
1 [7, 2, 1, 3, 4] 4 4
Or list comprehension with zip:
df['element'] = [x[y] for x, y in zip(df['list'], df['num'])]
If possible some values not match of list, here is possible use:
def func(a, b):
try:
return a[b]
except Exception:
return np.nan
df['element'] = df.apply(lambda x: func(x['list'], x['num']), axis=1)
Using numpy fancy index
list_val = np.array(df.list.values.tolist())
num_val = df.num.values
df['element'] = list_val[np.arange(df.shape[0]), num_val]
Out[295]:
list num element
0 [1, 2, 3, 4, 5] 3 4
1 [7, 2, 1, 3, 4] 4 4
I have a dataframe with two columns A and B that contains lists:
import pandas as pd
df = pd.DataFrame({"A" : [[1,5,10], [], [2], [1,2]],
"B" : [[15, 2], [], [6], []]})
I want to construct a third column C that is defined such that it is equal to the smallest possible difference between list-elements in A and B if they are non-empty, and 0 if one or both of them are empty.
For the first row the smallest difference is 1 (we take absolute value..), for the second row it is 0 due to lists being empty, third row is 4 and fourth row is 0 again due to one empty list, so we ultimately end up with:
df["C"] = [1, 0, 4, 0]
This isn't easily vectorisable, since you have object dtype series of lists. You can use a list comprehension with itertools.product:
from itertools import product
zipper = zip(df['A'], df['B'])
df['C'] = [min((abs(x - y) for x, y in product(*vals)), default=0) for vals in zipper]
# alternative:
# df['C'] = [min((abs(x - y) for x, y in product(*vals)), default=0) \
# for vals in df[['A', 'B']].values]
print(df)
# A B C
# 0 [1, 5, 10] [15, 2] 1
# 1 [] [] 0
# 2 [2] [6] 4
# 3 [1, 2] [] 0
You can use the following list comprehension, checking for the min difference of the cartesian product (itertools.product) from both columns
[min(abs(i-j) for i,j in product(*a)) if all(a) else 0 for a in df.values]
[1, 0, 4, 0]
df['C'] = df.apply(lambda row: min([abs(x - y) for x in row['A'] for y in row['B']], default=0), axis=1)
I just want to introduce the unnesting again
df['Diff']=unnesting(df[['B']],['B']).join(unnesting(df[['A']],['A'])).eval('C=B-A').C.abs().min(level=0)
df.Diff=df.Diff.fillna(0).astype(int)
df
Out[60]:
A B Diff
0 [1, 5, 10] [15, 2] 1
1 [] [] 0
2 [2] [6] 4
3 [1, 2] [] 0
FYI
def unnesting(df, explode):
idx=df.index.repeat(df[explode[0]].str.len())
df1=pd.concat([pd.DataFrame({x:np.concatenate(df[x].values)} )for x in explode],axis=1)
df1.index=idx
return df1.join(df.drop(explode,1),how='left')
I think this works
def diff(a,b):
if len(a) > 0 and len(b) > 0:
return min([abs(i-j) for i in a for j in b])
return 0
df['C'] = df.apply(lambda x: diff(x.A, x.B), axis=1)
df
A B C
0 [1, 5, 10] [15, 2] 1
1 [] [] 0
2 [2] [6] 4
3 [1, 2] [] 0