Pandas: Count values on a row basis - python

I have a numeric DataFrame, for example:
x = np.array([[1,2,3],[-1,-1,1],[0,0,0]])
df = pd.DataFrame(x, columns=['A','B','C'])
df
A B C
0 1 2 3
1 -1 -1 1
2 0 0 0
And I want to count, for each row, the number of positive values, negativa values and values equals to 0. I've been trying the following:
df['positive_count'] = df.apply(lambda row: (row > 0).sum(), axis = 1)
df['negative_count'] = df.apply(lambda row: (row < 0).sum(), axis = 1)
df['zero_count'] = df.apply(lambda row: (row == 0).sum(), axis = 1)
But I'm getting the following result, which is obviously incorrent
A B C positive_count negative_count zero_count
0 1 2 3 3 0 1
1 -1 -1 1 1 2 0
2 0 0 0 0 0 5
Anyone knows what might be going wrong, or could help me find the best way to do what I'm looking for?
Thank you.

There are some ways, but one option is using np.sign and get_dummies:
u = (pd.get_dummies(np.sign(df.stack()))
.sum(level=0)
.rename({-1: 'negative_count', 1: 'positive_count', 0: 'zero_count'}, axis=1))
u
negative_count zero_count positive_count
0 0 0 3
1 2 0 1
2 0 3 0
df = pd.concat([df, u], axis=1)
df
A B C negative_count zero_count positive_count
0 1 2 3 0 0 3
1 -1 -1 1 2 0 1
2 0 0 0 0 3 0
np.sign treats zero differently from positive and negative values, so it is ideal to use here.
Another option is groupby and value_counts:
(np.sign(df)
.stack()
.groupby(level=0)
.value_counts()
.unstack(1, fill_value=0)
.rename({-1: 'negative_count', 1: 'positive_count', 0: 'zero_count'}, axis=1))
negative_count zero_count positive_count
0 0 0 3
1 2 0 1
2 0 3 0
Slightly more verbose but still worth knowing about.

Related

How to split comma separated text into columns on pandas dataframe?

I have a dataframe where one of the columns has its items separated with commas. It looks like:
Data
a,b,c
a,c,d
d,e
a,e
a,b,c,d,e
My goal is to create a matrix that has as header all the unique values from column Data, meaning [a,b,c,d,e]. Then as rows a flag indicating if the value is at that particular row.
The matrix should look like this:
Data
a
b
c
d
e
a,b,c
1
1
1
0
0
a,c,d
1
0
1
1
0
d,e
0
0
0
1
1
a,e
1
0
0
0
1
a,b,c,d,e
1
1
1
1
1
To separate column Data what I did is:
df['data'].str.split(',', expand = True)
Then I don't know how to proceed to allocate the flags to each of the columns.
Maybe you can try this without pivot.
Create the dataframe.
import pandas as pd
import io
s = '''Data
a,b,c
a,c,d
d,e
a,e
a,b,c,d,e'''
df = pd.read_csv(io.StringIO(s), sep = "\s+")
We can use pandas.Series.str.split with expand argument equals to True. And value_counts each rows with axis = 1.
Finally fillna with zero and change the data into integer with astype(int).
df["Data"].str.split(pat = ",", expand=True).apply(lambda x : x.value_counts(), axis = 1).fillna(0).astype(int)
#
a b c d e
0 1 1 1 0 0
1 1 0 1 1 0
2 0 0 0 1 1
3 1 0 0 0 1
4 1 1 1 1 1
And then merge it with the original column.
new = df["Data"].str.split(pat = ",", expand=True).apply(lambda x : x.value_counts(), axis = 1).fillna(0).astype(int)
pd.concat([df, new], axis = 1)
#
Data a b c d e
0 a,b,c 1 1 1 0 0
1 a,c,d 1 0 1 1 0
2 d,e 0 0 0 1 1
3 a,e 1 0 0 0 1
4 a,b,c,d,e 1 1 1 1 1
Use the Series.str.get_dummies() method to return the required matrix of 'a', 'b', ... 'e' columns.
df["Data"].str.get_dummies(sep=',')
If you split the strings into lists, then explode them, it makes pivot possible.
(df.assign(data_list=df.Data.str.split(','))
.explode('data_list')
.pivot_table(index='Data',
columns='data_list',
aggfunc=lambda x: 1,
fill_value=0))
Output
data_list a b c d e
Data
a,b,c 1 1 1 0 0
a,b,c,d,e 1 1 1 1 1
a,c,d 1 0 1 1 0
a,e 1 0 0 0 1
d,e 0 0 0 1 1
You could apply a custom count function for each key:
for k in ["a","b","c","d","e"]:
df[k] = df.apply(lambda row: row["Data"].count(k), axis=1)

Concatenate column names by using the binary values in the columns

Currently, I have a dataframe as follows:
date A B C
02/19/2020 0 0 0
02/20/2020 0 0 0
02/21/2020 1 1 1
02/22/2020 0 1 0
02/23/2020 0 1 1
02/24/2020 0 0 1
02/25/2020 1 0 1
02/26/2020 1 0 0
The binary columns contain integers. The "date" column is a DateTime object. I want to create a new categorical column that is based on the binary columns as follows
date A B C new
02/19/2020 0 0 0 "None"
02/20/2020 0 0 0 "None"
02/21/2020 1 1 1 A+B+C
02/22/2020 0 1 0 B
02/23/2020 0 1 1 B+C
02/24/2020 0 0 1 C
02/25/2020 1 0 1 A+C
02/26/2020 1 0 0 A
How can I achieve this?
Use DataFrame.dot for matrix multiplication with columns names with omit first column by position in DataFrame.iloc, add separator to columns names without first and last remove separator by indexing str[:-1]:
df['new'] = df.iloc[:, 1:].dot(df.columns[1:] + '+').str[:-1]
#set empty string to None
df.loc[df['new'].eq(''), 'new'] = None
print (df)
date A B C new
0 02/19/2020 0 0 0 None
1 02/20/2020 0 0 0 None
2 02/21/2020 1 1 1 A+B+C
3 02/22/2020 0 1 0 B
4 02/23/2020 0 1 1 B+C
5 02/24/2020 0 0 1 C
6 02/25/2020 1 0 1 A+C
7 02/26/2020 1 0 0 A
If possible use NaNs instead Nones:
df['new'] = df.iloc[:, 1:].dot(df.columns[1:] + '+').str[:-1].replace('', np.nan)
print (df)
date A B C new
0 02/19/2020 0 0 0 NaN
1 02/20/2020 0 0 0 NaN
2 02/21/2020 1 1 1 A+B+C
3 02/22/2020 0 1 0 B
4 02/23/2020 0 1 1 B+C
5 02/24/2020 0 0 1 C
6 02/25/2020 1 0 1 A+C
7 02/26/2020 1 0 0 A
Or if possible set first column to DatetimeIndex use:
df1 = df.set_index('date')
df1['new'] = df1.dot(df1.columns + '+').str[:-1]
df1.loc[df1['new'].eq(''), 'new'] = None
You can iterate over the Dataframe to calculate the new columns values and then add it.
This is a basic example
new_column = []
for i, row in df.iterrows():
row_val = None
if row["A"]:
if row_val:
row_val += "+A"
else:
row_val = "A"
if row["B"]:
if row_val:
row_val += "+B"
else:
row_val = "B"
if row["C"]:
if row_val:
row_val += "+C"
else:
row_val = "C"
if row_val is None:
row_val = "None"
new_column.append(row_val)
df["new_column_name"] = new_column

Split columns with mixed integers an tuples to multiple columns

I have a df
a b
0 (0,1) 1
1 1 (1,2)
2 2 3
The desired output is:
w x y z
0 0 1 1 0
1 1 0 1 2
2 2 0 3 3
The problem is, that the tuples can have multiple different lenghts.
The following tolist() only works for tuples of length 2 and not for mixed columns.
df[['w', 'x']]=pd.DataFrame(df['a'].tolist(), index=df.index)
Any ideas?
Thanks in advance.
Idea is add tuples if scalars and then create new columns:
def f(col):
return pd.DataFrame([x if isinstance(x, tuple) else (x, )
for x in col]).fillna(0).astype(int)
df[['w', 'x']]=df.pop('a').pipe(f)
df[['y', 'z']]=df.pop('b').pipe(f)
print (df)
w x y z
0 0 1 1 0
1 1 0 1 2
2 2 0 3 0
More general solution with concat:
dfs= [pd.DataFrame([x if isinstance(x, tuple) else (x, ) for x in df.pop(c)],
index=df.index) for c in df.columns]
df = pd.concat(dfs, axis=1, ignore_index=True).fillna(0).astype(int)
print (df)
0 1 2 3
0 0 1 1 0
1 1 0 1 2
2 2 0 3 0
You can convert to str and then strip the () and split by ,
>>> df[['w', 'x']] = pd.DataFrame(df.pop('a')
.astype(str)
.str.strip('(/)')
.str.split(',')
.tolist()).fillna(0).astype(int)
>>> df[['y', 'z']] = pd.DataFrame(df.pop('b')
.astype(str)
.str.strip('(/)')
.str.split(',')
.tolist()).fillna(0).astype(int)
>>> df
w x y z
0 0 1 1 0
1 1 0 1 2
2 2 0 3 0

Count the frequency of list element in a row grouped by Date and tag

I have a dataframe df which looks like this:
ID Date Input
1 1-Nov A,B
1 2-NOV A
2 3-NOV A,B,C
2 4-NOV B,D
i want my output to count the occurrence of each input, if it is consecutive otherwise reset it to zero again(if IDs are same then only count) , Also the output should be renamed to X.A, X.B, X.C and X.D so my output will look like this:
ID Date Input X.A X.B X.C X.D
1 1-NOV A,B 1 1 0 0
1 2-NOV A 2 0 0 0
2 3-NOV A,B,C 1 1 1 0
2 4-NOV B,D 0 2 0 1
How can I create the output(A,B,C and D) which will count the input occurence date and ID wise.
Use Series.str.get_dummies for indicator columns and then count consecutive 1 per groups - so use GroupBy.cumsum with subtract by GroupBy.ffill, change columns names by DataFrame.add_prefix and last DataFrame.join to original:
a = df['Input'].str.get_dummies(',') == 1
b = a.groupby(df.ID).cumsum().astype(int)
df1 = (b-b.mask(a).groupby(df.ID).ffill().fillna(0).astype(int)).add_prefix('X.')
df = df.join(df1)
print (df)
ID Date Input X.A X.B X.C X.D
0 1 1-Nov A,B 1 1 0 0
1 1 2-NOV A 2 0 0 0
2 2 3-NOV A,B,C 1 1 1 0
3 2 4-NOV B,D 0 2 0 1
first add the counts of new columns and then use group by to make a cumulative sum
# find which columns to add
cols = set([l for sublist in df['Input'].apply(lambda x: x.split(',')).values for l in sublist])
# add the new columns
for col in cols:
df['X.' + col] = df['Input'].apply(lambda x: int(col in x))
# group by and add cumulative sum conditional it has a positive value
group = df.groupby('ID')
for col in cols:
df['X.' + col] = group['X.' + col].apply(lambda x: np.cumsum(x) * (x > 0).astype(int))
results is then
print(df)
ID Date Input X.C X.D X.A X.B
0 1 1-NOV A,B 0 0 1 1
1 1 2-NOV A 0 0 2 0
2 2 3-NOV A,B,C 1 0 1 1
3 2 4-NOV B,D 0 1 0 2

Compare two columns using pandas 2

I'm comparing two columns in a dataframe (A & B). I have a method that works (C5). It came from this question:
Compare two columns using pandas
I wondered why I couldn't get the other methods (C1 - C4) to give the correct answer:
df = pd.DataFrame({'A': [1,1,1,1,1,2,2,2,2,2],
'B': [1,1,1,1,1,1,0,0,0,0]})
#df['C1'] = 1 [df['A'] == df['B']]
df['C2'] = df['A'].equals(df['B'])
df['C3'] = np.where((df['A'] == df['B']),0,1)
def fun(row):
if ['A'] == ['B']:
return 1
else:
return 0
df['C4'] = df.apply(fun, axis=1)
df['C5'] = df.apply(lambda x : 1 if x['A'] == x['B'] else 0, axis=1)
Use:
df = pd.DataFrame({'A': [1,1,1,1,1,2,2,2,2,2],
'B': [1,1,1,1,1,1,0,0,0,0]})
So for C1 and C2 need compare columns by == or eq for boolean mask and then convert it to integers - True, False to 1,0:
df['C1'] = (df['A'] == df['B']).astype(int)
df['C2'] = df['A'].eq(df['B']).astype(int)
Here is necessary change order 1,0 - for match condition need 1:
df['C3'] = np.where((df['A'] == df['B']),1,0)
In function is not selected values of Series, missing row:
def fun(row):
if row['A'] == row['B']:
return 1
else:
return 0
df['C4'] = df.apply(fun, axis=1)
Solution is correct:
df['C5'] = df.apply(lambda x : 1 if x['A'] == x['B'] else 0, axis=1)
print (df)
A B C1 C2 C3 C4 C5
0 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1
2 1 1 1 1 1 1 1
3 1 1 1 1 1 1 1
4 1 1 1 1 1 1 1
5 2 1 0 0 0 0 0
6 2 0 0 0 0 0 0
7 2 0 0 0 0 0 0
8 2 0 0 0 0 0 0
9 2 0 0 0 0 0 0
IIUC you need this:
def fun(row):
if row['A'] == row['B']:
return 1
else:
return 0

Categories