Example data:
dictionary = {'col1':[[1,2,3],[1,2,3],[1,2,3]],'col2': [[1,'nan',3],
[1,2,'nan'],[1,2,3]], 'col3': [[1,2,3],[1,2,3],[1,'nan',3]]}
df = pd.DataFrame(dictionary)
I have a DataFrame of lists and I want to count the number of NaN values by row:
col1 col2 col3
[1,2,3] [1,NaN,3] [1,2,3]
[1,2,3] [1,2,NaN] [1,2,3]
[1,2,3] [1,2,3] [1,NaN,3]
I used
acceptable_data_df.iloc[:, 1:].apply(lambda x: list(itertools.chain(*x)), axis=1)
to convert them to one list and hopefully make it easier but I'm still stuck. (First column was text)
[1,2,3,1,NaN,3,1,2,3]
[1,2,3,1,2,NaN,1,2,3]
[1,2,3,1,2,3,1,NaN,3]
How can I do this?
You could stack + explode + isna to get a Series where it's True for NaN and False otherwise. Then groupby + sum fetches the number of NaN values per row:
df['number of NaN'] = df.stack().explode().isna().groupby(level=0).sum()
Output:
col1 col2 col3 number of NaN
0 [1, 2, 3] [1, nan, 3] [1, 2, 3] 1
1 [1, 2, 3] [1, 2, nan] [1, 2, 3] 1
2 [1, 2, 3] [1, 2, 3] [1, nan, 3] 1
IIUC use:
df['count'] = (df.col1 + df.col2 + df.col3).apply(lambda x: pd.isna(x).sum())
A "boring" way to count "NaN" while keeping the values as a string is to just count how many times the "NaN" string appears:
(df["col1"] + df["col2"] + df["col3"]).str.count("NaN")
Related
i am very new to pandas can anybody tell me how to map uniquely lists for a dataframe?
Data
[phone, laptop]
[life, death, mortal]
[happy]
Expected output:
[1,2]
[3,4,5]
[6]
I used map() and enumerate but both give me errors.
For efficiency, use a list comprehension.
For simple counts:
from itertools import count
c = count(1)
df['new'] = [[next(c) for x in l ] for l in df['Data']]
For unique identifiers in case of duplicates:
from itertools import count
c = count(1)
d = {}
df['new'] = [[d[x] if x in d else d.setdefault(x, next(c)) for x in l ] for l in df['Data']]
Output:
Data new
0 [phone, laptop] [1, 2]
1 [life, death, mortal] [3, 4, 5]
2 [happy] [6]
You could explode, replace, and groupby to undo the explode operation:
df = pd.DataFrame({"data": [["phone", "laptop"],
["life", "death", "mortal"],
["happy", "phone"]]})
df_expl = df.explode("data")
df["data_mapped"] = (
df_expl.assign(data=lambda df: range(1, len(df) + 1))
.groupby(df_expl.index).data.apply(list))
print(df)
data data_mapped
0 [phone, laptop] [1, 2]
1 [life, death, mortal] [3, 4, 5]
2 [happy, phone] [6, 7]
This always increments the counter, even if list items are duplicates.
In case duplicates should have unique integer values, use factorize instead of range:
df_expl = df.explode("data")
df["data_mapped"] = (
df_expl.assign(data=df_expl.data.factorize()[0] + 1)
.groupby(df_expl.index).data.apply(list))
print(df)
# output:
data data_mapped
0 [phone, laptop] [1, 2]
1 [life, death, mortal] [3, 4, 5]
2 [happy, phone] [6, 1]
Here's an example of how the output should look like:
Dataframe: df with required output
class_id item req_output
a 1 [1]
a 2 [1,2]
a 3 [1,2,3]
b 1 [1]
b 2 [1,2]
I've tried:
df.groupby("class").apply(lambda x: list(x["item"])
class_id output
a [1,2,3]
b [1,2]
but this only gives the whole aggregation, however I need the aggregation to happen in every row considering the class
First, make each element into a list of size 1. Here, we are (exploitingabusing?) the fact [1] + [2] = [1, 2]. Then group by and GroupBy.apply Series.cumsum.
df["req_output"] = (
df["item"]
.map(lambda x: [x])
.groupby(df["class_id"])
.apply(lambda x: x.cumsum())
)
class_id item req_output
0 a 1 [1]
1 a 2 [1, 2]
2 a 3 [1, 2, 3]
3 b 1 [1]
4 b 2 [1, 2]
Or we can make a function to return the desired list and use GroupBy.transform.
def get_slices(s):
"""
>>> get_slices( pd.Series([1, 2, 3]) )
[[1], [1, 2], [1, 2, 3]]
"""
lst = s.tolist()
return [lst[:i] for i in range(1, len(lst)+1)]
df['req_output'] = df.groupby('class_id')['item'].transform(get_slices)
I have a column in a dataframe containing lists of strings as such:
id colname
1 ['str1', 'str2', 'str3']
2 ['str3', 'str4']
3 ['str3']
4 ['str2', 'str5', 'str6']
..
The strings in the list have some overlap as you can see.
I would like to append the lists in this column with a value, for example 'strX'. The end result should be:
id colname
1 ['str1', 'str2', 'str3', 'strX']
2 ['str3', 'str4', 'strX']
3 ['str3', 'strX']
4 ['str2', 'str5', 'str6', 'strX']
..
What would be the proper way to achieve this? I have tried appending and adding, but these don't get me the desired result.
So if you want to append "strx" to all you can do as #jezrael point it out like this:
df = pd.DataFrame({"1": [[1,2,3], [3,4,5,6]]}, index=[1,2])
print(df)
1
1 [1, 2, 3]
2 [3, 4, 5, 6]
df.apply(lambda x: x["1"].append('strx'), axis=1)
print(df)
1
1 [1, 2, 3, strx]
2 [3, 4, 5, 6, strx]
But if you want to add a different value based on index you also could!
Let take the same df but with a dict that precise what to add
dico = {1: "strx", 2: "strj"}
df.apply(lambda x: x["1"].append(dico[x.name]), axis=1)
print(df)
1
1 [1, 2, 3, strx]
2 [3, 4, 5, 6, strj]
You can add list in Series.map or Series.apply:
df['colname'] = df['colname'].map(lambda x: x + ['strX'])
#alternative
#df['colname'] = df['colname'].apply(lambda x: x + ['strX'])
#or in list comprehension
#df['colname'] = [x + ['strX'] for x in df['colname']]
print (df)
id colname
0 1 [str1, str2, str3, strX]
1 2 [str3, str4, strX]
2 3 [str3, strX]
3 4 [str2, str5, str6, strX]
Say I have the following dataframe:
df =pd.DataFrame({'col1':[5,'',2], 'col2':['','',1], 'col3':[9,'','']})
print(df)
col1 col2 col3
5 9
1
2 2 1
Is there a simple way to turn it into a pd.Series of lists, avoiding empty elements? So:
0 [5,9]
1 [1]
2 [2,2,1]
You can try using df.values
Just take df.values. Convert them into list and remove empty elements using map:
In [2193]: df
Out[2193]:
col1 col2 col3
0 5 9
1 1
2 2 2 1
One-liner:
In [2186]: pd.Series(df.values.tolist()).map(lambda row: [x for x in row if x != ''])
Out[2186]:
0 [5, 9]
1 [1]
2 [2, 2, 1]
dtype: object
You can use this
In[1]: [x[x.apply(lambda k: k != '')].tolist() for i, x in df.iterrows()]
Out[1]: [[5, 9], [], [2, 1]]
Similar to #jezreal's solution. But if you do not expect 0 values, you can use the inherent False-ness of empty strings:
L = [x[x.astype(bool)].tolist() for i, x in df.T.items()]
res = pd.Series(L, index=df.index)
Can be done as follows:
# Break down into list of tuples
records = df.to_records().tolist()
# Convert tuples into lists
series = pd.Series(records).map(list)
# Get rid of empty strings
series.map(lambda row: list(filter(lambda x: x != '', row)))
# ... alternatively
series.map(lambda row: [x for x in row if x != ''])
resulting in
0 [0, 5, 9]
1 [1]
2 [2, 2, 1]
Use list comprehension with remove empty value:
L = [x[x != ''].tolist() for i, x in df.T.items()]
s = pd.Series(L, index=df.index)
Or convert values to lists by to_dict with parameter split:
L = df.to_dict(orient='split')['data']
print (L)
[[5, '', 9], ['', '', ''], [2, 1, '']]
And then remove empty values:
s = pd.Series([[y for y in x if y != ''] for x in L], index=df.index)
print (s)
0 [5, 9]
1 []
2 [2, 1]
dtype: object
I know how to get most frequent value of each column in dataframe using "mode". For example:
df = pd.DataFrame({'A': [1, 2, 1, 2, 2, 3]})
df.mode()
A
0 2
But I am unable to find "n" most frequent value of each column of a dataframe? For example for the mentioned dataframe, i would like following output for n=2:
A
0 2
1 1
Any pointer ?
One way is to use pd.Series.value_counts and extract the index:
df = pd.DataFrame({'A': [1, 2, 1, 2, 2, 3]})
res = pd.DataFrame({col: df[col].value_counts().head(2).index for col in df})
# A
# 0 2
# 1 1
Use value_counts and select index values by indexing, but it working for each column separately, so need apply or dict comprehension with DataFrame contructor. Casting to Series is necessary for more general solution if possible indices does not exist, e.g:
df = pd.DataFrame({'A': [1, 2, 1, 2, 2, 3],
'B': [1, 1, 1, 1, 1, 1]})
N = 2
df = df.apply(lambda x: pd.Series(x.value_counts().index[:N]))
Or:
N = 2
df = pd.DataFrame({x:pd.Series( df[x].value_counts().index[:N]) for x in df.columns})
print (df)
A B C
0 2 1.0 d
1 1 NaN e
For more general solution select only numeric columns first by select_dtypes:
df = pd.DataFrame({'A': [1, 2, 1, 2, 2, 3],
'B': [1, 1, 1, 1, 1, 1],
'C': list('abcdef')})
N = 2
df = df.select_dtypes([np.number]).apply(lambda x: pd.Series(x.value_counts().index[:N]))
N = 2
cols = df.select_dtypes([np.number]).columns
df = pd.DataFrame({x: pd.Series(df[x].value_counts().index[:N]) for x in cols})
print (df)
A B C
0 2 1.0 d
1 1 NaN e