How to separate elements in a pandas Dataframe - python

I am trying to handle the following dataframe
import pandas as pd
df =pd.DataFrame(
data = {'m1' : [0,0,1,0,0,0,0,0,0,0,0],
'm2' : [0,0,0,0,0,1,0,0,0,0,0],
'm3' : [0,0,0,0,0,0,0,0,1,0,0],
'm4' : [0,1,0,0,0,0,0,0,0,0,0],
'm5' : [0,0,0,0,0,0,0,0,0,0,0],
'm6' : [0,0,0,0,0,0,0,0,0,1,0]}
)
df
#
m1 m2 m3 m4 m5 m6
0 0 0 0 0 0 0
1 0 0 0 1 0 0
2 1 0 0 0 0 0
3 0 0 0 0 0 0
4 0 0 0 0 0 0
5 0 1 0 0 0 0
6 0 0 0 0 0 0
7 0 0 0 0 0 0
8 0 0 1 0 0 0
9 0 0 0 0 0 1
10 0 0 0 0 0 0
From the above dataframe, I want to separate m1 and other features.
Assign 1 to m_other if any of m2 to m6 is 1.
Ideal results are shown below.
m1 m_other
0 0 0
1 0 1
2 1 0
3 0 0
4 0 0
5 0 1
6 0 0
7 0 0
8 0 1
9 0 1
10 0 0
I thought about adapting the any function, but I stumbled and couldn't figure it out.
If anyone has any good ideas, I would appreciate it if you could share them with me.

Use DataFrame.any or DataFrame.max with left join to m1 column:
#select all columns without first
df1 = df[['m1']].assign(m_other=df.iloc[:, 1:].max(axis=1))
df1 = df[['m1']].assign(m_other=df.iloc[:, 1:].any(axis=1).astype(int))
#select all columns without m1
df1 = df[['m1']].assign(m_other=df.drop('m1',1).max(axis=1))
#seelct columns between m2 and m6
df1 = df[['m1']].assign(m_other=df.loc[:, 'm2':'m6'].max(axis=1))
print (df1)
m1 m_other
0 0 0
1 0 1
2 1 0
3 0 0
4 0 0
5 0 1
6 0 0
7 0 0
8 0 1
9 0 1
10 0 0

Does this work:
pd.concat([df['m1'], df.iloc[:,1:].apply(lambda x : 1 if x.any() == 1 else 0, axis = 1)], axis = 1, keys = ['m1','m_other'])
m1 m_other
0 0 0
1 0 1
2 1 0
3 0 0
4 0 0
5 0 1
6 0 0
7 0 0
8 0 1
9 0 1
10 0 0

A more simplistic way is to seperate it into two dataframes then recombine it.
#data is the datframe of the result
data=pd.DataFrame(columns=['m1','m_other'])
# there's no change in m1 so we assign it directly
data.m1=df.m1
# we create a data for the other columns
data_other=df[['m2','m3','m4','m5','m6']]
# we assign True if anyone from 2 to 6 has 1 value
data.m_other=[any(data_other.iloc[i]==1) for i in range(len(df))]
# we map it to 1 and 0 instead of True and False
data.m_other=data.m_other.astype(int)
# this is our final results
data

Here is one way to do it
using concat to combine the first column and the max of the renaming columns and then renaming the column name
df2=pd.concat([df.iloc[:,:1],(df.iloc[:,1:].max(axis=1))], axis=1)
df2=df2.rename(columns={0:'m_other'})
df2
m1 m_other
0 0 0
1 0 1
2 1 0
3 0 0
4 0 0
5 0 1
6 0 0
7 0 0
8 0 1
9 0 1
10 0 0

Related

Convert one-hot encoded data-frame columns into one column

In the pandas data frame, the one-hot encoded vectors are present as columns, i.e:
Rows A B C D E
0 0 0 0 1 0
1 0 0 1 0 0
2 0 1 0 0 0
3 0 0 0 1 0
4 1 0 0 0 0
4 0 0 0 0 1
How to convert these columns into one data frame column by label encoding them in python? i.e:
Rows A
0 4
1 3
2 2
3 4
4 1
5 5
Also need suggestion on this that some rows have multiple 1s, how to handle those rows because we can have only one category at a time.
Try with argmax
#df=df.set_index('Rows')
df['New']=df.values.argmax(1)+1
df
Out[231]:
A B C D E New
Rows
0 0 0 0 1 0 4
1 0 0 1 0 0 3
2 0 1 0 0 0 2
3 0 0 0 1 0 4
4 1 0 0 0 0 1
4 0 0 0 0 1 5
argmaxis the way to go, adding another way using idxmax and get_indexer:
df['New'] = df.columns.get_indexer(df.idxmax(1))+1
#df.idxmax(1).map(df.columns.get_loc)+1
print(df)
Rows A B C D E New
0 0 0 0 1 0 4
1 0 0 1 0 0 3
2 0 1 0 0 0 2
3 0 0 0 1 0 4
4 1 0 0 0 0 1
5 0 0 0 0 1 5
Also need suggestion on this that some rows have multiple 1s, how to
handle those rows because we can have only one category at a time.
In this case you dot your DataFrame of dummies with an array of all the powers of 2 (based on the number of columns). This ensures that the presence of any unique combination of dummies (A, A+B, A+B+C, B+C, ...) will have a unique category label. (Added a few rows at the bottom to illustrate the unique counting)
df['Category'] = df.dot(2**np.arange(df.shape[1]))
A B C D E Category
Rows
0 0 0 0 1 0 8
1 0 0 1 0 0 4
2 0 1 0 0 0 2
3 0 0 0 1 0 8
4 1 0 0 0 0 1
5 0 0 0 0 1 16
6 1 0 0 0 1 17
7 0 1 0 0 1 18
8 1 1 0 0 1 19
Another readable solution on top of other great solutions provided that works for ANY type of variables in your dataframe:
df['variables'] = np.where(df.values)[1]+1
output:
A B C D E variables
0 0 0 0 1 0 4
1 0 0 1 0 0 3
2 0 1 0 0 0 2
3 0 0 0 1 0 4
4 1 0 0 0 0 1
5 0 0 0 0 1 5

Python reset cumulative sum over intervals in a column

I am trying to do cumulative sum by intervals ie. with cumsum being reset to zero if the next value to accumulate is 0. Below is an example with the desired result following. I have tried using numpy 'convolve' and 'groupby' but can't get come up with a way to do the reset except by creating a def that loops over all the rows. Is there a clever approach I'm missing? Note that the real data in column 'x' are real numbers separated by 0's.
import numpy as np
import pandas as pd
a = pd.DataFrame([[0,0],[1,0],[1,0],[1,0],[0,0],[0,0],[0,0],[0,0],[0,0],[0,0],\
[0,0],[0,0],[0,0],[0,0],[1,0],[1,0],[0,0]], columns=["x","y"])
def patch(k):
k["z"] = k.x.cumsum()
return k
print(patch(a))
Current output:
x y z
0 0 0 0
1 1 0 1
2 1 0 2
3 1 0 3
4 0 0 3
6 0 0 3
7 0 0 3
9 0 0 3
10 0 0 3
12 0 0 3
13 1 0 4
15 1 0 5
16 0 0 5
Desired output:
x y z
0 0 0 0
1 1 0 1
2 1 0 2
3 1 0 3
4 0 0 0
6 0 0 0
7 0 0 0
9 0 0 0
10 0 0 0
12 0 0 0
13 1 0 1
15 1 0 2
16 0 0 0
Do groupby on cumsum:
a['z'] = a.groupby(a['x'].eq(0).cumsum())['x'].cumsum()
Output:
x y z
0 0 0 0
1 1 0 1
2 1 0 2
3 1 0 3
4 0 0 0
6 0 0 0
7 0 0 0
9 0 0 0
10 0 0 0
12 0 0 0
13 1 0 1
15 1 0 2
16 0 0 0

Accumulate 1 and Reset to 0 once condition is met

Currently I have a dataset below and I try to accumulate the value if ColA is 0 while reset the value to 0 (restart counting again) if the ColA is 1 again.
ColA
1
0
1
1
0
1
0
0
0
1
0
0
0
My expected result is as below.
ColA Accumulate
1 0
0 1
1 0
1 0
0 1
1 0
0 1
0 2
0 3
1 0
0 1
0 2
0 3
The current code I use
test['Value'] = np.where ( (test['ColA']==1),test['ColA'].cumsum() ,0)
ColA Value
1 0
0 1
1 0
1 0
0 2
1 0
0 3
0 4
0 5
1 0
0 6
0 7
0 8
Use cumsum if performance is important:
a = df['ColA'] == 0
cumsumed = a.cumsum()
df['Accumulate'] = cumsumed-cumsumed.where(~a).ffill().fillna(0).astype(int)
print (df)
ColA Accumulate
0 1 0
1 0 1
2 1 0
3 1 0
4 0 1
5 1 0
6 0 1
7 0 2
8 0 3
9 1 0
10 0 1
11 0 2
12 0 3
This should do it:
test['Value'] = (test['ColA']==0) * 1 * (test['ColA'].groupby((test['ColA'] != test['ColA'].shift()).cumsum()).cumcount() + 1)
It is an adaption of this answer.

Pandas One hot encoding: Bundling together less frequent categories

I'm doing one hot encoding over a categorical column which has some 18 different kind of values. I want to create new columns for only those values, which appear more than some threshold (let's say 1%), and create another column named other values which has 1 if value is other than those frequent values.
I'm using Pandas with Sci-kit learn. I've explored pandas get_dummies and sci-kit learn's one hot encoder, but can't figure out how to bundle together less frequent values into one column.
plan
pd.get_dummies to one hot encode as normal
sum() < threshold to identify columns that get aggregated
I use pd.value_counts with the parameter normalize=True to get percentage of occurance.
join
def hot_mess2(s, thresh):
d = pd.get_dummies(s)
f = pd.value_counts(s, sort=False, normalize=True) < thresh
if f.sum() == 0:
return d
else:
return d.loc[:, ~f].join(d.loc[:, f].sum(1).rename('other'))
Consider the pd.Series s
s = pd.Series(np.repeat(list('abcdef'), range(1, 7)))
s
0 a
1 b
2 b
3 c
4 c
5 c
6 d
7 d
8 d
9 d
10 e
11 e
12 e
13 e
14 e
15 f
16 f
17 f
18 f
19 f
20 f
dtype: object
hot_mess(s, 0)
a b c d e f
0 1 0 0 0 0 0
1 0 1 0 0 0 0
2 0 1 0 0 0 0
3 0 0 1 0 0 0
4 0 0 1 0 0 0
5 0 0 1 0 0 0
6 0 0 0 1 0 0
7 0 0 0 1 0 0
8 0 0 0 1 0 0
9 0 0 0 1 0 0
10 0 0 0 0 1 0
11 0 0 0 0 1 0
12 0 0 0 0 1 0
13 0 0 0 0 1 0
14 0 0 0 0 1 0
15 0 0 0 0 0 1
16 0 0 0 0 0 1
17 0 0 0 0 0 1
18 0 0 0 0 0 1
19 0 0 0 0 0 1
20 0 0 0 0 0 1
hot_mess(s, .1)
c d e f other
0 0 0 0 0 1
1 0 0 0 0 1
2 0 0 0 0 1
3 1 0 0 0 0
4 1 0 0 0 0
5 1 0 0 0 0
6 0 1 0 0 0
7 0 1 0 0 0
8 0 1 0 0 0
9 0 1 0 0 0
10 0 0 1 0 0
11 0 0 1 0 0
12 0 0 1 0 0
13 0 0 1 0 0
14 0 0 1 0 0
15 0 0 0 1 0
16 0 0 0 1 0
17 0 0 0 1 0
18 0 0 0 1 0
19 0 0 0 1 0
20 0 0 0 1 0
How about something like the following:
create a data frame
df = pd.DataFrame(data=list('abbgcca'), columns=['x'])
df
x
0 a
1 b
2 b
3 g
4 c
5 c
6 a
Replace values that are present less frequently than a given threshold. I'll create a copy of the column so that I'm not modifying the original dataframe. First step is to create a dictionary of the value_counts and then replace the actual values with those counts so that they can be compared to the threshold. Set values below that threshold to 'other values' then use pd.get_dummies to get the dummy variables
#set the threshold for example 20%
thresh = 0.2
x = df.x.copy()
#replace any values present less than the threshold with 'other values'
x[x.replace(x.value_counts().to_dict()) < len(x)*thresh] = 'other values'
#get dummies
pd.get_dummies(x)
a b c other values
0 1.0 0.0 0.0 0.0
1 0.0 1.0 0.0 0.0
2 0.0 1.0 0.0 0.0
3 0.0 0.0 0.0 1.0
4 0.0 0.0 1.0 0.0
5 0.0 0.0 1.0 0.0
6 1.0 0.0 0.0 0.0
Alternatively you could use Counter it may be a bit cleaner
from collections import Counter
x[x.replace(Counter(x)) < len(x)*thresh] = 'other values'
pip install siuba
#( in python or anaconda prompth shell)
#use library as:
from siuba.dply.forcats import fct_lump, fct_reorder
#just like fct_lump of R :
df['Your_column'] = fct_lump(df['Your_column'], n= 10)
df['Your_column'].value_counts() # check your levels
#it reduces the level to 10, lumps all the others as 'Other'
R has a good function fct_lump for this purpose, now it is copied to python, simply you select the number of levels to keep and all the other levels will be bundled as 'others' .
An improved version:
The previous solutions do not scale quite well when the dataframe
is large.
The situation also becomes complicated when you want to perform one-hot encoding for one column only and your original dataframe has more than one columns.
Here is a more general and scalable (faster) solution.
It is illustrated with an example df with two columns and 1 million rows:
import pandas as pd
import string
df = pd.DataFrame(
{'1st': [random.sample(["orange", "apple", "banana"], k=1)[0] for i in range(1000000)],\
'2nd': [random.sample(list(string.ascii_lowercase), k=1)[0] for i in range(1000000)]}
)
The first 10 rows df.head(10) is:
1st 2nd
0 banana t
1 orange t
2 banana m
3 banana g
4 banana g
5 orange a
6 apple x
7 orange s
8 orange d
9 apple u
The statistics df['2nd'].value_counts() is :
s 39004
k 38726
n 38720
b 38699
t 38688
p 38646
u 38638
w 38611
y 38587
o 38576
q 38559
x 38558
r 38545
i 38497
h 38429
v 38385
m 38369
j 38278
f 38262
e 38241
a 38241
l 38236
g 38210
z 38202
c 38058
d 38035
Step 1: Define threshold
threshold = 38500
Step 2: Focus on the column(s) you want to do one-hot encoding on, and change the entries with frequency lower than the threshold to others
%timeit df.loc[df['2nd'].value_counts()[df['2nd']].values < threshold, '2nd'] = "others"
Time taken is 206 ms ± 346 µs per loop (mean ± std. dev. of 7 runs, 1 loop each).
Step 3: Apply one-hot encoding as usual
df = pd.get_dummies(df, columns = ['2nd'], prefix='', prefix_sep='')
The first 10 rows after one-hot encoding df.head(10) becomes
1st b k n o others p q r s t u w x y
0 banana 0 0 0 0 0 0 0 0 0 1 0 0 0 0
1 orange 0 0 0 0 0 0 0 0 0 1 0 0 0 0
2 banana 0 0 0 0 1 0 0 0 0 0 0 0 0 0
3 banana 0 0 0 0 1 0 0 0 0 0 0 0 0 0
4 banana 0 0 0 0 1 0 0 0 0 0 0 0 0 0
5 orange 0 0 0 0 1 0 0 0 0 0 0 0 0 0
6 apple 0 0 0 0 0 0 0 0 0 0 0 0 1 0
7 orange 0 0 0 0 0 0 0 0 1 0 0 0 0 0
8 orange 0 0 0 0 1 0 0 0 0 0 0 0 0 0
9 apple 0 0 0 0 0 0 0 0 0 0 1 0 0 0
Step 4 (optional): If you want others to be the last column of the df, you can try:
df = df[[col for col in df.columns if col != 'others'] + ['others']]
This shifts others to the last column.
1st b k n o p q r s t u w x y others
0 banana 0 0 0 0 0 0 0 0 1 0 0 0 0 0
1 orange 0 0 0 0 0 0 0 0 1 0 0 0 0 0
2 banana 0 0 0 0 0 0 0 0 0 0 0 0 0 1
3 banana 0 0 0 0 0 0 0 0 0 0 0 0 0 1
4 banana 0 0 0 0 0 0 0 0 0 0 0 0 0 1
5 orange 0 0 0 0 0 0 0 0 0 0 0 0 0 1
6 apple 0 0 0 0 0 0 0 0 0 0 0 1 0 0
7 orange 0 0 0 0 0 0 0 1 0 0 0 0 0 0
8 orange 0 0 0 0 0 0 0 0 0 0 0 0 0 1
9 apple 0 0 0 0 0 0 0 0 0 1 0 0 0 0

Pivotting via Python and Pandas

Has a table like this:
ID Word
1 take
2 the
3 long
4 long
5 road
6 and
7 walk
8 it
9 walk
10 it
Wanna to use pivot table in pandas to get distinct words in columns and 1 and 0 in Values. Smth like this matrix:
ID Take The Long Road And Walk It
1 1 0 0 0 0 0 0
2 0 1 0 0 0 0 0
3 0 0 1 0 0 0 0
4 0 0 1 0 0 0 0
5 0 0 0 1 0 0 0
and so on
Trying to use pivot table but not familiar with pandas syntax yet:
import pandas as pd
data = pd.read_csv('dataset.txt', sep='|', encoding='latin1')
table = pd.pivot_table(data,index=["ID"],columns=pd.unique(data["Word"].values),fill_value=0)
How can I rewrite pivot table function to deal with it?
You can use concatwith str.get_dummies:
print pd.concat([df['ID'], df['Word'].str.get_dummies()], axis=1)
ID and it long road take the walk
0 1 0 0 0 0 1 0 0
1 2 0 0 0 0 0 1 0
2 3 0 0 1 0 0 0 0
3 4 0 0 1 0 0 0 0
4 5 0 0 0 1 0 0 0
5 6 1 0 0 0 0 0 0
6 7 0 0 0 0 0 0 1
7 8 0 1 0 0 0 0 0
8 9 0 0 0 0 0 0 1
9 10 0 1 0 0 0 0 0
Or as Edchum mentioned in comments - pd.get_dummies:
print pd.concat([df['ID'], pd.get_dummies(df['Word'])], axis=1)
ID and it long road take the walk
0 1 0 0 0 0 1 0 0
1 2 0 0 0 0 0 1 0
2 3 0 0 1 0 0 0 0
3 4 0 0 1 0 0 0 0
4 5 0 0 0 1 0 0 0
5 6 1 0 0 0 0 0 0
6 7 0 0 0 0 0 0 1
7 8 0 1 0 0 0 0 0
8 9 0 0 0 0 0 0 1
9 10 0 1 0 0 0 0 0

Categories