sum for every row values through columns pandas - python

This is my dataframe and I want to sum for every row values through columns A,B,C,D and append column 'Summ'
A B C D Summ
0 1 1 0 0 1+1+0+0
1 0 0 1 1 0+0+1+1
2 0 0 1 0 0+0+1+0
3 1 1 1 1 1+1+1+1
4 1 0 1 0 1+0+1+0

df['Summ'] = df.sum(axis=1)
or better:
df.loc[:, 'Summ'] = df.sum(axis=1)
or for a subset of columns
cols = ['A','B']
df.loc[:, 'Summ'] = df[cols].sum(axis=1)

Related

How to split comma separated text into columns on pandas dataframe?

I have a dataframe where one of the columns has its items separated with commas. It looks like:
Data
a,b,c
a,c,d
d,e
a,e
a,b,c,d,e
My goal is to create a matrix that has as header all the unique values from column Data, meaning [a,b,c,d,e]. Then as rows a flag indicating if the value is at that particular row.
The matrix should look like this:
Data
a
b
c
d
e
a,b,c
1
1
1
0
0
a,c,d
1
0
1
1
0
d,e
0
0
0
1
1
a,e
1
0
0
0
1
a,b,c,d,e
1
1
1
1
1
To separate column Data what I did is:
df['data'].str.split(',', expand = True)
Then I don't know how to proceed to allocate the flags to each of the columns.
Maybe you can try this without pivot.
Create the dataframe.
import pandas as pd
import io
s = '''Data
a,b,c
a,c,d
d,e
a,e
a,b,c,d,e'''
df = pd.read_csv(io.StringIO(s), sep = "\s+")
We can use pandas.Series.str.split with expand argument equals to True. And value_counts each rows with axis = 1.
Finally fillna with zero and change the data into integer with astype(int).
df["Data"].str.split(pat = ",", expand=True).apply(lambda x : x.value_counts(), axis = 1).fillna(0).astype(int)
#
a b c d e
0 1 1 1 0 0
1 1 0 1 1 0
2 0 0 0 1 1
3 1 0 0 0 1
4 1 1 1 1 1
And then merge it with the original column.
new = df["Data"].str.split(pat = ",", expand=True).apply(lambda x : x.value_counts(), axis = 1).fillna(0).astype(int)
pd.concat([df, new], axis = 1)
#
Data a b c d e
0 a,b,c 1 1 1 0 0
1 a,c,d 1 0 1 1 0
2 d,e 0 0 0 1 1
3 a,e 1 0 0 0 1
4 a,b,c,d,e 1 1 1 1 1
Use the Series.str.get_dummies() method to return the required matrix of 'a', 'b', ... 'e' columns.
df["Data"].str.get_dummies(sep=',')
If you split the strings into lists, then explode them, it makes pivot possible.
(df.assign(data_list=df.Data.str.split(','))
.explode('data_list')
.pivot_table(index='Data',
columns='data_list',
aggfunc=lambda x: 1,
fill_value=0))
Output
data_list a b c d e
Data
a,b,c 1 1 1 0 0
a,b,c,d,e 1 1 1 1 1
a,c,d 1 0 1 1 0
a,e 1 0 0 0 1
d,e 0 0 0 1 1
You could apply a custom count function for each key:
for k in ["a","b","c","d","e"]:
df[k] = df.apply(lambda row: row["Data"].count(k), axis=1)

Concatenate column names by using the binary values in the columns

Currently, I have a dataframe as follows:
date A B C
02/19/2020 0 0 0
02/20/2020 0 0 0
02/21/2020 1 1 1
02/22/2020 0 1 0
02/23/2020 0 1 1
02/24/2020 0 0 1
02/25/2020 1 0 1
02/26/2020 1 0 0
The binary columns contain integers. The "date" column is a DateTime object. I want to create a new categorical column that is based on the binary columns as follows
date A B C new
02/19/2020 0 0 0 "None"
02/20/2020 0 0 0 "None"
02/21/2020 1 1 1 A+B+C
02/22/2020 0 1 0 B
02/23/2020 0 1 1 B+C
02/24/2020 0 0 1 C
02/25/2020 1 0 1 A+C
02/26/2020 1 0 0 A
How can I achieve this?
Use DataFrame.dot for matrix multiplication with columns names with omit first column by position in DataFrame.iloc, add separator to columns names without first and last remove separator by indexing str[:-1]:
df['new'] = df.iloc[:, 1:].dot(df.columns[1:] + '+').str[:-1]
#set empty string to None
df.loc[df['new'].eq(''), 'new'] = None
print (df)
date A B C new
0 02/19/2020 0 0 0 None
1 02/20/2020 0 0 0 None
2 02/21/2020 1 1 1 A+B+C
3 02/22/2020 0 1 0 B
4 02/23/2020 0 1 1 B+C
5 02/24/2020 0 0 1 C
6 02/25/2020 1 0 1 A+C
7 02/26/2020 1 0 0 A
If possible use NaNs instead Nones:
df['new'] = df.iloc[:, 1:].dot(df.columns[1:] + '+').str[:-1].replace('', np.nan)
print (df)
date A B C new
0 02/19/2020 0 0 0 NaN
1 02/20/2020 0 0 0 NaN
2 02/21/2020 1 1 1 A+B+C
3 02/22/2020 0 1 0 B
4 02/23/2020 0 1 1 B+C
5 02/24/2020 0 0 1 C
6 02/25/2020 1 0 1 A+C
7 02/26/2020 1 0 0 A
Or if possible set first column to DatetimeIndex use:
df1 = df.set_index('date')
df1['new'] = df1.dot(df1.columns + '+').str[:-1]
df1.loc[df1['new'].eq(''), 'new'] = None
You can iterate over the Dataframe to calculate the new columns values and then add it.
This is a basic example
new_column = []
for i, row in df.iterrows():
row_val = None
if row["A"]:
if row_val:
row_val += "+A"
else:
row_val = "A"
if row["B"]:
if row_val:
row_val += "+B"
else:
row_val = "B"
if row["C"]:
if row_val:
row_val += "+C"
else:
row_val = "C"
if row_val is None:
row_val = "None"
new_column.append(row_val)
df["new_column_name"] = new_column

Count the frequency of list element in a row grouped by Date and tag

I have a dataframe df which looks like this:
ID Date Input
1 1-Nov A,B
1 2-NOV A
2 3-NOV A,B,C
2 4-NOV B,D
i want my output to count the occurrence of each input, if it is consecutive otherwise reset it to zero again(if IDs are same then only count) , Also the output should be renamed to X.A, X.B, X.C and X.D so my output will look like this:
ID Date Input X.A X.B X.C X.D
1 1-NOV A,B 1 1 0 0
1 2-NOV A 2 0 0 0
2 3-NOV A,B,C 1 1 1 0
2 4-NOV B,D 0 2 0 1
How can I create the output(A,B,C and D) which will count the input occurence date and ID wise.
Use Series.str.get_dummies for indicator columns and then count consecutive 1 per groups - so use GroupBy.cumsum with subtract by GroupBy.ffill, change columns names by DataFrame.add_prefix and last DataFrame.join to original:
a = df['Input'].str.get_dummies(',') == 1
b = a.groupby(df.ID).cumsum().astype(int)
df1 = (b-b.mask(a).groupby(df.ID).ffill().fillna(0).astype(int)).add_prefix('X.')
df = df.join(df1)
print (df)
ID Date Input X.A X.B X.C X.D
0 1 1-Nov A,B 1 1 0 0
1 1 2-NOV A 2 0 0 0
2 2 3-NOV A,B,C 1 1 1 0
3 2 4-NOV B,D 0 2 0 1
first add the counts of new columns and then use group by to make a cumulative sum
# find which columns to add
cols = set([l for sublist in df['Input'].apply(lambda x: x.split(',')).values for l in sublist])
# add the new columns
for col in cols:
df['X.' + col] = df['Input'].apply(lambda x: int(col in x))
# group by and add cumulative sum conditional it has a positive value
group = df.groupby('ID')
for col in cols:
df['X.' + col] = group['X.' + col].apply(lambda x: np.cumsum(x) * (x > 0).astype(int))
results is then
print(df)
ID Date Input X.C X.D X.A X.B
0 1 1-NOV A,B 0 0 1 1
1 1 2-NOV A 0 0 2 0
2 2 3-NOV A,B,C 1 0 1 1
3 2 4-NOV B,D 0 1 0 2

Creating a dataframe with binary valued columns with pandas using values from an existing dataframe

I am trying to create a new dataframe with binary (0 or 1) values from an exisitng dataframe. For every row in the given dataframe, the program should take value from each cell and set 1 for the corresponding columns of the row indexed with same number in the new dataframe
I have tried executing the following code snippet.
for col in products :
index = 0;
for item in products.loc[col] :
products_coded.ix[index, 'prod_' + str(item)] = 1;
index = index + 1;
It works for less number of rows. But,it takes lot of time for any large dataset. What could be the best way to get the desired outcome.
I think you need:
first get_dummies with casting values to strings
aggregate max by columns names max
for correct ordering convert columns to int
reindex for ordering and append missing columns, replace NaNs by 0 by parameter fill_value=0 and remove first 0 column
add_prefix for rename columns
df = pd.DataFrame({'B':[3,1,12,12,8],
'C':[0,6,0,14,0],
'D':[0,14,0,0,0]})
print (df)
B C D
0 3 0 0
1 1 6 14
2 12 0 0
3 12 14 0
4 8 0 0
df1 = (pd.get_dummies(df.astype(str), prefix='', prefix_sep='')
.max(level=0, axis=1)
.rename(columns=lambda x: int(x))
.reindex(columns=range(1, df.values.max() + 1), fill_value=0)
.add_prefix('prod_'))
print (df1)
prod_1 prod_2 prod_3 prod_4 prod_5 prod_6 prod_7 prod_8 prod_9 \
0 0 0 1 0 0 0 0 0 0
1 1 0 0 0 0 1 0 0 0
2 0 0 0 0 0 0 0 0 0
3 0 0 0 0 0 0 0 0 0
4 0 0 0 0 0 0 0 1 0
prod_10 prod_11 prod_12 prod_13 prod_14
0 0 0 0 0 0
1 0 0 0 0 1
2 0 0 1 0 0
3 0 0 1 0 1
4 0 0 0 0 0
Another similar solution:
df1 = (pd.get_dummies(df.astype(str), prefix='', prefix_sep='')
.max(level=0, axis=1))
df1.columns = df1.columns.astype(int)
df1 = (df1.reindex(columns=range(1, df1.columns.max() + 1), fill_value=0)
.add_prefix('prod_'))

Quickest way to make a get_dummies type dataframe from a column with a multiple of strings

I have a column, 'col2', that has a list of strings. The current code I have is too slow, there's about 2000 unique strings (the letters in the example below), and 4000 rows. Ending up as 2000 columns and 4000 rows.
In [268]: df.head()
Out[268]:
col1 col2
0 6 A,B
1 15 C,G,A
2 25 B
Is there a fast way to make this in a get dummies format? Where each string has it's own column and in each string's column there is a 0 or 1 if it that row has that string in col2.
In [268]: def get_list(df):
d = []
for row in df.col2:
row_list = row.split(',')
for string in row_list:
if string not in d:
d.append(string)
return d
df_list = get_list(df)
def make_cols(df, lst):
for string in lst:
df[string] = 0
return df
df = make_cols(df, df_list)
for idx in range(0, len(df['col2'])):
row_list = df['col2'].iloc[idx].split(',')
for string in row_list:
df[string].iloc[idx]+= 1
Out[113]:
col1 col2 A B C G
0 6 A,B 1 1 0 0
1 15 C,G,A 1 0 1 1
2 25 B 0 1 0 0
This is my current code for it but it's too slow.
Thanks you any help!
You can use:
>>> df['col2'].str.get_dummies(sep=',')
A B C G
0 1 1 0 0
1 1 0 1 1
2 0 1 0 0
To join the Dataframes:
>>> pd.concat([df, df['col2'].str.get_dummies(sep=',')], axis=1)
col1 col2 A B C G
0 6 A,B 1 1 0 0
1 15 C,G,A 1 0 1 1
2 25 B 0 1 0 0

Categories