Counting NaN values by row, in a List

Counting NaN values by row, in a List - python

Example data:
dictionary = {'col1':[[1,2,3],[1,2,3],[1,2,3]],'col2': [[1,'nan',3],
[1,2,'nan'],[1,2,3]], 'col3': [[1,2,3],[1,2,3],[1,'nan',3]]}
df = pd.DataFrame(dictionary)
I have a DataFrame of lists and I want to count the number of NaN values by row:
col1 col2 col3
[1,2,3] [1,NaN,3] [1,2,3]
[1,2,3] [1,2,NaN] [1,2,3]
[1,2,3] [1,2,3] [1,NaN,3]
I used
acceptable_data_df.iloc[:, 1:].apply(lambda x: list(itertools.chain(*x)), axis=1)
to convert them to one list and hopefully make it easier but I'm still stuck. (First column was text)
[1,2,3,1,NaN,3,1,2,3]
[1,2,3,1,2,NaN,1,2,3]
[1,2,3,1,2,3,1,NaN,3]
How can I do this?

You could stack + explode + isna to get a Series where it's True for NaN and False otherwise. Then groupby + sum fetches the number of NaN values per row:
df['number of NaN'] = df.stack().explode().isna().groupby(level=0).sum()
Output:
col1 col2 col3 number of NaN
0 [1, 2, 3] [1, nan, 3] [1, 2, 3] 1
1 [1, 2, 3] [1, 2, nan] [1, 2, 3] 1
2 [1, 2, 3] [1, 2, 3] [1, nan, 3] 1

IIUC use:
df['count'] = (df.col1 + df.col2 + df.col3).apply(lambda x: pd.isna(x).sum())

A "boring" way to count "NaN" while keeping the values as a string is to just count how many times the "NaN" string appears:
(df["col1"] + df["col2"] + df["col3"]).str.count("NaN")

Related

How to map uniquely list in a dataframe?

i am very new to pandas can anybody tell me how to map uniquely lists for a dataframe?
Data
[phone, laptop]
[life, death, mortal]
[happy]
Expected output:
[1,2]
[3,4,5]
[6]
I used map() and enumerate but both give me errors.

For efficiency, use a list comprehension.
For simple counts:
from itertools import count
c = count(1)
df['new'] = [[next(c) for x in l ] for l in df['Data']]
For unique identifiers in case of duplicates:
from itertools import count
c = count(1)
d = {}
df['new'] = [[d[x] if x in d else d.setdefault(x, next(c)) for x in l ] for l in df['Data']]
Output:
Data new
0 [phone, laptop] [1, 2]
1 [life, death, mortal] [3, 4, 5]
2 [happy] [6]

You could explode, replace, and groupby to undo the explode operation:
df = pd.DataFrame({"data": [["phone", "laptop"],
["life", "death", "mortal"],
["happy", "phone"]]})
df_expl = df.explode("data")
df["data_mapped"] = (
df_expl.assign(data=lambda df: range(1, len(df) + 1))
.groupby(df_expl.index).data.apply(list))
print(df)
data data_mapped
0 [phone, laptop] [1, 2]
1 [life, death, mortal] [3, 4, 5]
2 [happy, phone] [6, 7]
This always increments the counter, even if list items are duplicates.
In case duplicates should have unique integer values, use factorize instead of range:
df_expl = df.explode("data")
df["data_mapped"] = (
df_expl.assign(data=df_expl.data.factorize()[0] + 1)
.groupby(df_expl.index).data.apply(list))
print(df)
# output:
data data_mapped
0 [phone, laptop] [1, 2]
1 [life, death, mortal] [3, 4, 5]
2 [happy, phone] [6, 1]

Group by aggregate elements so far in the same group - Pandas

Here's an example of how the output should look like:
Dataframe: df with required output
class_id item req_output
a 1 [1]
a 2 [1,2]
a 3 [1,2,3]
b 1 [1]
b 2 [1,2]
I've tried:
df.groupby("class").apply(lambda x: list(x["item"])
class_id output
a [1,2,3]
b [1,2]
but this only gives the whole aggregation, however I need the aggregation to happen in every row considering the class

First, make each element into a list of size 1. Here, we are (exploitingabusing?) the fact [1] + [2] = [1, 2]. Then group by and GroupBy.apply Series.cumsum.
df["req_output"] = (
df["item"]
.map(lambda x: [x])
.groupby(df["class_id"])
.apply(lambda x: x.cumsum())
)
class_id item req_output
0 a 1 [1]
1 a 2 [1, 2]
2 a 3 [1, 2, 3]
3 b 1 [1]
4 b 2 [1, 2]
Or we can make a function to return the desired list and use GroupBy.transform.
def get_slices(s):
"""
>>> get_slices( pd.Series([1, 2, 3]) )
[[1], [1, 2], [1, 2, 3]]
"""
lst = s.tolist()
return [lst[:i] for i in range(1, len(lst)+1)]
df['req_output'] = df.groupby('class_id')['item'].transform(get_slices)

appending to lists in column of dataframe

I have a column in a dataframe containing lists of strings as such:
id colname
1 ['str1', 'str2', 'str3']
2 ['str3', 'str4']
3 ['str3']
4 ['str2', 'str5', 'str6']
..
The strings in the list have some overlap as you can see.
I would like to append the lists in this column with a value, for example 'strX'. The end result should be:
id colname
1 ['str1', 'str2', 'str3', 'strX']
2 ['str3', 'str4', 'strX']
3 ['str3', 'strX']
4 ['str2', 'str5', 'str6', 'strX']
..
What would be the proper way to achieve this? I have tried appending and adding, but these don't get me the desired result.

So if you want to append "strx" to all you can do as #jezrael point it out like this:
df = pd.DataFrame({"1": [[1,2,3], [3,4,5,6]]}, index=[1,2])
print(df)
1
1 [1, 2, 3]
2 [3, 4, 5, 6]
df.apply(lambda x: x["1"].append('strx'), axis=1)
print(df)
1
1 [1, 2, 3, strx]
2 [3, 4, 5, 6, strx]
But if you want to add a different value based on index you also could!
Let take the same df but with a dict that precise what to add
dico = {1: "strx", 2: "strj"}
df.apply(lambda x: x["1"].append(dico[x.name]), axis=1)
print(df)
1
1 [1, 2, 3, strx]
2 [3, 4, 5, 6, strj]

You can add list in Series.map or Series.apply:
df['colname'] = df['colname'].map(lambda x: x + ['strX'])
#alternative
#df['colname'] = df['colname'].apply(lambda x: x + ['strX'])
#or in list comprehension
#df['colname'] = [x + ['strX'] for x in df['colname']]
print (df)
id colname
0 1 [str1, str2, str3, strX]
1 2 [str3, str4, strX]
2 3 [str3, strX]
3 4 [str2, str5, str6, strX]

Dataframe to Series of lists

Say I have the following dataframe:
df =pd.DataFrame({'col1':[5,'',2], 'col2':['','',1], 'col3':[9,'','']})
print(df)
col1 col2 col3
5 9
1
2 2 1
Is there a simple way to turn it into a pd.Series of lists, avoiding empty elements? So:
0 [5,9]
1 [1]
2 [2,2,1]

You can try using df.values
Just take df.values. Convert them into list and remove empty elements using map:
In [2193]: df
Out[2193]:
col1 col2 col3
0 5 9
1 1
2 2 2 1
One-liner:
In [2186]: pd.Series(df.values.tolist()).map(lambda row: [x for x in row if x != ''])
Out[2186]:
0 [5, 9]
1 [1]
2 [2, 2, 1]
dtype: object

You can use this
In[1]: [x[x.apply(lambda k: k != '')].tolist() for i, x in df.iterrows()]
Out[1]: [[5, 9], [], [2, 1]]

Similar to #jezreal's solution. But if you do not expect 0 values, you can use the inherent False-ness of empty strings:
L = [x[x.astype(bool)].tolist() for i, x in df.T.items()]
res = pd.Series(L, index=df.index)

Can be done as follows:
# Break down into list of tuples
records = df.to_records().tolist()
# Convert tuples into lists
series = pd.Series(records).map(list)
# Get rid of empty strings
series.map(lambda row: list(filter(lambda x: x != '', row)))
# ... alternatively
series.map(lambda row: [x for x in row if x != ''])
resulting in
0 [0, 5, 9]
1 [1]
2 [2, 2, 1]

Use list comprehension with remove empty value:
L = [x[x != ''].tolist() for i, x in df.T.items()]
s = pd.Series(L, index=df.index)
Or convert values to lists by to_dict with parameter split:
L = df.to_dict(orient='split')['data']
print (L)
[[5, '', 9], ['', '', ''], [2, 1, '']]
And then remove empty values:
s = pd.Series([[y for y in x if y != ''] for x in L], index=df.index)
print (s)
0 [5, 9]
1 []
2 [2, 1]
dtype: object

How to get n most column values from each column in pandas

I know how to get most frequent value of each column in dataframe using "mode". For example:
df = pd.DataFrame({'A': [1, 2, 1, 2, 2, 3]})
df.mode()
A
0 2
But I am unable to find "n" most frequent value of each column of a dataframe? For example for the mentioned dataframe, i would like following output for n=2:
A
0 2
1 1
Any pointer ?

One way is to use pd.Series.value_counts and extract the index:
df = pd.DataFrame({'A': [1, 2, 1, 2, 2, 3]})
res = pd.DataFrame({col: df[col].value_counts().head(2).index for col in df})
# A
# 0 2
# 1 1

Use value_counts and select index values by indexing, but it working for each column separately, so need apply or dict comprehension with DataFrame contructor. Casting to Series is necessary for more general solution if possible indices does not exist, e.g:
df = pd.DataFrame({'A': [1, 2, 1, 2, 2, 3],
'B': [1, 1, 1, 1, 1, 1]})
N = 2
df = df.apply(lambda x: pd.Series(x.value_counts().index[:N]))
Or:
N = 2
df = pd.DataFrame({x:pd.Series( df[x].value_counts().index[:N]) for x in df.columns})
print (df)
A B C
0 2 1.0 d
1 1 NaN e
For more general solution select only numeric columns first by select_dtypes:
df = pd.DataFrame({'A': [1, 2, 1, 2, 2, 3],
'B': [1, 1, 1, 1, 1, 1],
'C': list('abcdef')})
N = 2
df = df.select_dtypes([np.number]).apply(lambda x: pd.Series(x.value_counts().index[:N]))
N = 2
cols = df.select_dtypes([np.number]).columns
df = pd.DataFrame({x: pd.Series(df[x].value_counts().index[:N]) for x in cols})
print (df)
A B C
0 2 1.0 d
1 1 NaN e

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Counting NaN values by row, in a List - python

IIUC use: df['count'] = (df.col1 + df.col2 + df.col3).apply(lambda x: pd.isna(x).sum())

A "boring" way to count "NaN" while keeping the values as a string is to just count how many times the "NaN" string appears: (df["col1"] + df["col2"] + df["col3"]).str.count("NaN")

Related

How to map uniquely list in a dataframe?

Group by aggregate elements so far in the same group - Pandas

appending to lists in column of dataframe

Dataframe to Series of lists

How to get n most column values from each column in pandas

Categories

Resources