appending to lists in column of dataframe - python

I have a column in a dataframe containing lists of strings as such:
id colname
1 ['str1', 'str2', 'str3']
2 ['str3', 'str4']
3 ['str3']
4 ['str2', 'str5', 'str6']
..
The strings in the list have some overlap as you can see.
I would like to append the lists in this column with a value, for example 'strX'. The end result should be:
id colname
1 ['str1', 'str2', 'str3', 'strX']
2 ['str3', 'str4', 'strX']
3 ['str3', 'strX']
4 ['str2', 'str5', 'str6', 'strX']
..
What would be the proper way to achieve this? I have tried appending and adding, but these don't get me the desired result.

So if you want to append "strx" to all you can do as #jezrael point it out like this:
df = pd.DataFrame({"1": [[1,2,3], [3,4,5,6]]}, index=[1,2])
print(df)
1
1 [1, 2, 3]
2 [3, 4, 5, 6]
df.apply(lambda x: x["1"].append('strx'), axis=1)
print(df)
1
1 [1, 2, 3, strx]
2 [3, 4, 5, 6, strx]
But if you want to add a different value based on index you also could!
Let take the same df but with a dict that precise what to add
dico = {1: "strx", 2: "strj"}
df.apply(lambda x: x["1"].append(dico[x.name]), axis=1)
print(df)
1
1 [1, 2, 3, strx]
2 [3, 4, 5, 6, strj]

You can add list in Series.map or Series.apply:
df['colname'] = df['colname'].map(lambda x: x + ['strX'])
#alternative
#df['colname'] = df['colname'].apply(lambda x: x + ['strX'])
#or in list comprehension
#df['colname'] = [x + ['strX'] for x in df['colname']]
print (df)
id colname
0 1 [str1, str2, str3, strX]
1 2 [str3, str4, strX]
2 3 [str3, strX]
3 4 [str2, str5, str6, strX]

Related

How to map uniquely list in a dataframe?

i am very new to pandas can anybody tell me how to map uniquely lists for a dataframe?
Data
[phone, laptop]
[life, death, mortal]
[happy]
Expected output:
[1,2]
[3,4,5]
[6]
I used map() and enumerate but both give me errors.
For efficiency, use a list comprehension.
For simple counts:
from itertools import count
c = count(1)
df['new'] = [[next(c) for x in l ] for l in df['Data']]
For unique identifiers in case of duplicates:
from itertools import count
c = count(1)
d = {}
df['new'] = [[d[x] if x in d else d.setdefault(x, next(c)) for x in l ] for l in df['Data']]
Output:
Data new
0 [phone, laptop] [1, 2]
1 [life, death, mortal] [3, 4, 5]
2 [happy] [6]
You could explode, replace, and groupby to undo the explode operation:
df = pd.DataFrame({"data": [["phone", "laptop"],
["life", "death", "mortal"],
["happy", "phone"]]})
df_expl = df.explode("data")
df["data_mapped"] = (
df_expl.assign(data=lambda df: range(1, len(df) + 1))
.groupby(df_expl.index).data.apply(list))
print(df)
data data_mapped
0 [phone, laptop] [1, 2]
1 [life, death, mortal] [3, 4, 5]
2 [happy, phone] [6, 7]
This always increments the counter, even if list items are duplicates.
In case duplicates should have unique integer values, use factorize instead of range:
df_expl = df.explode("data")
df["data_mapped"] = (
df_expl.assign(data=df_expl.data.factorize()[0] + 1)
.groupby(df_expl.index).data.apply(list))
print(df)
# output:
data data_mapped
0 [phone, laptop] [1, 2]
1 [life, death, mortal] [3, 4, 5]
2 [happy, phone] [6, 1]

Counting NaN values by row, in a List

Example data:
dictionary = {'col1':[[1,2,3],[1,2,3],[1,2,3]],'col2': [[1,'nan',3],
[1,2,'nan'],[1,2,3]], 'col3': [[1,2,3],[1,2,3],[1,'nan',3]]}
df = pd.DataFrame(dictionary)
I have a DataFrame of lists and I want to count the number of NaN values by row:
col1 col2 col3
[1,2,3] [1,NaN,3] [1,2,3]
[1,2,3] [1,2,NaN] [1,2,3]
[1,2,3] [1,2,3] [1,NaN,3]
I used
acceptable_data_df.iloc[:, 1:].apply(lambda x: list(itertools.chain(*x)), axis=1)
to convert them to one list and hopefully make it easier but I'm still stuck. (First column was text)
[1,2,3,1,NaN,3,1,2,3]
[1,2,3,1,2,NaN,1,2,3]
[1,2,3,1,2,3,1,NaN,3]
How can I do this?
You could stack + explode + isna to get a Series where it's True for NaN and False otherwise. Then groupby + sum fetches the number of NaN values per row:
df['number of NaN'] = df.stack().explode().isna().groupby(level=0).sum()
Output:
col1 col2 col3 number of NaN
0 [1, 2, 3] [1, nan, 3] [1, 2, 3] 1
1 [1, 2, 3] [1, 2, nan] [1, 2, 3] 1
2 [1, 2, 3] [1, 2, 3] [1, nan, 3] 1
IIUC use:
df['count'] = (df.col1 + df.col2 + df.col3).apply(lambda x: pd.isna(x).sum())
A "boring" way to count "NaN" while keeping the values as a string is to just count how many times the "NaN" string appears:
(df["col1"] + df["col2"] + df["col3"]).str.count("NaN")

Extracting element from a list in a pandas column based on the value in another column

I have a following pandas df and I would like to extract the element from the list column based on whatever number that is on the num column:
list num
[1,2,3,4,5] 3
[7,2,1,3,4] 4
To obtain:
list num element
[1,2,3,4,5] 3 4
[7,2,1,3,4] 4 4
I have tried:
df['element'] = df['list'].apply(lambda x: x[df['num'].apply(lambda y: y)])
But I got TypeError: list indices must be integers or slices, not Series.
Is there anyway I can do this? Thank you!
Use DataFrame.apply per rows with axis=1:
df['element'] = df.apply(lambda x: x['list'][x['num']], axis=1)
print (df)
list num element
0 [1, 2, 3, 4, 5] 3 4
1 [7, 2, 1, 3, 4] 4 4
Or list comprehension with zip:
df['element'] = [x[y] for x, y in zip(df['list'], df['num'])]
If possible some values not match of list, here is possible use:
def func(a, b):
try:
return a[b]
except Exception:
return np.nan
df['element'] = df.apply(lambda x: func(x['list'], x['num']), axis=1)
Using numpy fancy index
list_val = np.array(df.list.values.tolist())
num_val = df.num.values
df['element'] = list_val[np.arange(df.shape[0]), num_val]
Out[295]:
list num element
0 [1, 2, 3, 4, 5] 3 4
1 [7, 2, 1, 3, 4] 4 4

Dataframe to Series of lists

Say I have the following dataframe:
df =pd.DataFrame({'col1':[5,'',2], 'col2':['','',1], 'col3':[9,'','']})
print(df)
col1 col2 col3
5 9
1
2 2 1
Is there a simple way to turn it into a pd.Series of lists, avoiding empty elements? So:
0 [5,9]
1 [1]
2 [2,2,1]
You can try using df.values
Just take df.values. Convert them into list and remove empty elements using map:
In [2193]: df
Out[2193]:
col1 col2 col3
0 5 9
1 1
2 2 2 1
One-liner:
In [2186]: pd.Series(df.values.tolist()).map(lambda row: [x for x in row if x != ''])
Out[2186]:
0 [5, 9]
1 [1]
2 [2, 2, 1]
dtype: object
You can use this
In[1]: [x[x.apply(lambda k: k != '')].tolist() for i, x in df.iterrows()]
Out[1]: [[5, 9], [], [2, 1]]
Similar to #jezreal's solution. But if you do not expect 0 values, you can use the inherent False-ness of empty strings:
L = [x[x.astype(bool)].tolist() for i, x in df.T.items()]
res = pd.Series(L, index=df.index)
Can be done as follows:
# Break down into list of tuples
records = df.to_records().tolist()
# Convert tuples into lists
series = pd.Series(records).map(list)
# Get rid of empty strings
series.map(lambda row: list(filter(lambda x: x != '', row)))
# ... alternatively
series.map(lambda row: [x for x in row if x != ''])
resulting in
0 [0, 5, 9]
1 [1]
2 [2, 2, 1]
Use list comprehension with remove empty value:
L = [x[x != ''].tolist() for i, x in df.T.items()]
s = pd.Series(L, index=df.index)
Or convert values to lists by to_dict with parameter split:
L = df.to_dict(orient='split')['data']
print (L)
[[5, '', 9], ['', '', ''], [2, 1, '']]
And then remove empty values:
s = pd.Series([[y for y in x if y != ''] for x in L], index=df.index)
print (s)
0 [5, 9]
1 []
2 [2, 1]
dtype: object

How to get n most column values from each column in pandas

I know how to get most frequent value of each column in dataframe using "mode". For example:
df = pd.DataFrame({'A': [1, 2, 1, 2, 2, 3]})
df.mode()
A
0 2
But I am unable to find "n" most frequent value of each column of a dataframe? For example for the mentioned dataframe, i would like following output for n=2:
A
0 2
1 1
Any pointer ?
One way is to use pd.Series.value_counts and extract the index:
df = pd.DataFrame({'A': [1, 2, 1, 2, 2, 3]})
res = pd.DataFrame({col: df[col].value_counts().head(2).index for col in df})
# A
# 0 2
# 1 1
Use value_counts and select index values by indexing, but it working for each column separately, so need apply or dict comprehension with DataFrame contructor. Casting to Series is necessary for more general solution if possible indices does not exist, e.g:
df = pd.DataFrame({'A': [1, 2, 1, 2, 2, 3],
'B': [1, 1, 1, 1, 1, 1]})
N = 2
df = df.apply(lambda x: pd.Series(x.value_counts().index[:N]))
Or:
N = 2
df = pd.DataFrame({x:pd.Series( df[x].value_counts().index[:N]) for x in df.columns})
print (df)
A B C
0 2 1.0 d
1 1 NaN e
For more general solution select only numeric columns first by select_dtypes:
df = pd.DataFrame({'A': [1, 2, 1, 2, 2, 3],
'B': [1, 1, 1, 1, 1, 1],
'C': list('abcdef')})
N = 2
df = df.select_dtypes([np.number]).apply(lambda x: pd.Series(x.value_counts().index[:N]))
N = 2
cols = df.select_dtypes([np.number]).columns
df = pd.DataFrame({x: pd.Series(df[x].value_counts().index[:N]) for x in cols})
print (df)
A B C
0 2 1.0 d
1 1 NaN e

Categories