In the pandas data frame, the one-hot encoded vectors are present as columns, i.e:
Rows A B C D E
0 0 0 0 1 0
1 0 0 1 0 0
2 0 1 0 0 0
3 0 0 0 1 0
4 1 0 0 0 0
4 0 0 0 0 1
How to convert these columns into one data frame column by label encoding them in python? i.e:
Rows A
0 4
1 3
2 2
3 4
4 1
5 5
Also need suggestion on this that some rows have multiple 1s, how to handle those rows because we can have only one category at a time.
Try with argmax
#df=df.set_index('Rows')
df['New']=df.values.argmax(1)+1
df
Out[231]:
A B C D E New
Rows
0 0 0 0 1 0 4
1 0 0 1 0 0 3
2 0 1 0 0 0 2
3 0 0 0 1 0 4
4 1 0 0 0 0 1
4 0 0 0 0 1 5
argmaxis the way to go, adding another way using idxmax and get_indexer:
df['New'] = df.columns.get_indexer(df.idxmax(1))+1
#df.idxmax(1).map(df.columns.get_loc)+1
print(df)
Rows A B C D E New
0 0 0 0 1 0 4
1 0 0 1 0 0 3
2 0 1 0 0 0 2
3 0 0 0 1 0 4
4 1 0 0 0 0 1
5 0 0 0 0 1 5
Also need suggestion on this that some rows have multiple 1s, how to
handle those rows because we can have only one category at a time.
In this case you dot your DataFrame of dummies with an array of all the powers of 2 (based on the number of columns). This ensures that the presence of any unique combination of dummies (A, A+B, A+B+C, B+C, ...) will have a unique category label. (Added a few rows at the bottom to illustrate the unique counting)
df['Category'] = df.dot(2**np.arange(df.shape[1]))
A B C D E Category
Rows
0 0 0 0 1 0 8
1 0 0 1 0 0 4
2 0 1 0 0 0 2
3 0 0 0 1 0 8
4 1 0 0 0 0 1
5 0 0 0 0 1 16
6 1 0 0 0 1 17
7 0 1 0 0 1 18
8 1 1 0 0 1 19
Another readable solution on top of other great solutions provided that works for ANY type of variables in your dataframe:
df['variables'] = np.where(df.values)[1]+1
output:
A B C D E variables
0 0 0 0 1 0 4
1 0 0 1 0 0 3
2 0 1 0 0 0 2
3 0 0 0 1 0 4
4 1 0 0 0 0 1
5 0 0 0 0 1 5
Related
My apologies SO community, I am a newbie on the platform and in the pursuit of making this question precise and straight to the point, I didn't give relevant info.
My Input Dataframe is:
import pandas as pd
data = {'user_id': ['abc','def','ghi'],
'alpha': ['A','B,C,D,A','B,C,A'],
'beta': ['1|20|30','350','376|98']}
df = pd.DataFrame(data = data, columns = ['user_id','alpha','beta'])
print(df)
Looks like this,
user_id alpha beta
0 abc A 1|20|30
1 def B,C,D,A 350
2 ghi B,C,A 376
I want something like this,
user_id alpha beta a_A a_B a_C a_D b_1 b_20 b_30 b_350 b_376
0 abc A 1|20|30 1 0 0 0 1 1 1 1 0
1 def B,C,D,A 350 1 1 1 1 0 0 0 1 0
2 ghi B,C,A 376 1 1 1 0 0 0 0 0 1
My original data contains 11K rows. And these distinct values in alpha & beta are around 550.
I created a list from all the values in alpha & beta columns and applied pd.get_dummies but it results in a lot of rows like the one displayed by #wwwnde. I would like all the rows to be rolled up based on user_id.
A similar idea is used by CountVectorizer on documents, where it creates columns based on all the words in the sentence and checks the frequency of a word. However, I am guessing Pandas has a better and efficient way to do that.
Grateful for all your assistance. :)
Desired Output
You will have to achieve that in a series of steps.
Sample Data
id ALPHA BETA
0 1 A 1|20|30
1 2 B,C,D,A 350
2 3 B,C,A 395|45|90
Create Lists for values in ALPHA and BETA
df.BETA=df.BETA.apply(lambda x: x.split('|'))#.str.join(',')
df.ALPHA=df.ALPHA.apply(lambda x: x.split(','))#.str.join(',')
Disintegrate the list elements into individuals
df=df.explode('ALPHA')
df=df.explode('BETA')
Extract the variable frequencies using get dummies.
pd.get_dummies(df)
Strip columns of the prefix
df.columns=df.columns.str.replace('ALPHA_|BETA_','')
id A B C D 1 20 30 350 395 45 90
0 1 1 0 0 0 1 0 0 0 0 0 0
0 1 1 0 0 0 0 1 0 0 0 0 0
0 1 1 0 0 0 0 0 1 0 0 0 0
1 2 0 1 0 0 0 0 0 1 0 0 0
1 2 0 0 1 0 0 0 0 1 0 0 0
1 2 0 0 0 1 0 0 0 1 0 0 0
1 2 1 0 0 0 0 0 0 1 0 0 0
2 3 0 1 0 0 0 0 0 0 1 0 0
2 3 0 1 0 0 0 0 0 0 0 1 0
2 3 0 1 0 0 0 0 0 0 0 0 1
2 3 0 0 1 0 0 0 0 0 1 0 0
2 3 0 0 1 0 0 0 0 0 0 1 0
2 3 0 0 1 0 0 0 0 0 0 0 1
2 3 1 0 0 0 0 0 0 0 1 0 0
2 3 1 0 0 0 0 0 0 0 0 1 0
2 3 1 0 0 0 0 0 0 0 0 0 1
Here is an example.
a b k c
0 0 0 0
0 1 1 0
0 2 0 0
0 3 0 0
0 4 1 0
0 5 0 0
0 0 0 1
0 1 1 1
0 2 0 1
0 3 0 1
0 4 1 1
0 5 0 1
1 0 0 0
1 1 1 0
1 2 0 0
1 3 1 0
1 4 0 0
1 0 0 1
1 1 1 1
1 2 0 1
1 3 1 1
1 4 0 1
Here, "a" is user id, "b" is time, 'c' is product and "k" is a binary indicator flag. For each c, "b" is consecutive for sure and binary flag 'k' of a unique pair (a,b) is same, which means it is independent with 'c'. What I want to get is this:
a b k c diff_b
0 0 0 0 nan
0 1 1 0 nan
0 2 0 0 1
0 3 0 0 2
0 4 1 0 3
0 5 0 0 1
0 0 0 1 nan
0 1 1 1 nan
0 2 0 1 1
0 3 0 1 2
0 4 1 1 3
0 5 0 1 1
1 0 0 0 nan
1 1 1 0 nan
1 2 0 0 1
1 3 1 0 2
1 4 0 0 1
1 0 0 1 nan
1 1 1 1 nan
1 2 0 1 1
1 3 1 1 2
1 4 0 1 1
So, diff_b is a time difference variable. It shows the duration between the current time point and the last time point with an action. If there is never an action before, it returns nan. This diff_b is grouped by a. For each user, this diff_b is calculated independently and for a same user but different product, it should be independent with product also.
Thank you.
You just need to adding the c into the group indicator at second step
df['New']=df.b.loc[df.k==1]# get all value b when k equal to 1
df.New=df.groupby(['a','c']).New.apply(lambda x : x.ffill().shift()) # fillna by froward method , then we need shift.
df.b-df['New']
Given a pandas DataFrame, how does one convert several binary columns (where 1 denotes the value exists, 0 denotes it doesn't) into a single categorical column?
Another way to think of this is how to perform the "reverse pd.get_dummies()"?
Here is an example of converting a categorical column into several binary columns:
import pandas as pd
s = pd.Series(list('ABCDAB'))
df = pd.get_dummies(s)
df
A B C D
0 1 0 0 0
1 0 1 0 0
2 0 0 1 0
3 0 0 0 1
4 1 0 0 0
5 0 1 0 0
What I would like to accomplish is given a dataframe
df1
A B C D
0 1 0 0 0
1 0 1 0 0
2 0 0 1 0
3 0 0 0 1
4 1 0 0 0
5 0 1 0 0
could do I convert it into
df1
A B C D category
0 1 0 0 0 A
1 0 1 0 0 B
2 0 0 1 0 C
3 0 0 0 1 D
4 1 0 0 0 A
5 0 1 0 0 B
One way would be to use idxmax to find the 1s:
In [32]: df["category"] = df.idxmax(axis=1)
In [33]: df
Out[33]:
A B C D category
0 1 0 0 0 A
1 0 1 0 0 B
2 0 0 1 0 C
3 0 0 0 1 D
4 1 0 0 0 A
5 0 1 0 0 B
I have a dataframe like the following:
Labels
1 Nail_Polish,Nails
2 Nail_Polish,Nails
3 Foot_Care,Targeted_Body_Care
4 Foot_Care,Targeted_Body_Care,Skin_Care
I want to generate the following matrix:
Nail_Polish Nails Foot_Care Targeted_Body_Care Skin_Care
1 1 1 0 0 0
2 1 1 0 0 0
3 0 0 1 1 0
4 0 0 1 1 1
How can I achieve this?
Use str.get_dummies:
df2 = df['Labels'].str.get_dummies(sep=',')
The resulting output:
Foot_Care Nail_Polish Nails Skin_Care Targeted_Body_Care
1 0 1 1 0 0
2 0 1 1 0 0
3 1 0 0 0 1
4 1 0 0 1 1
With this code:
from itertools import product
for a, b, c, d in product(range(low, high), repeat=4):
print (a, b, c, d)
I have an output like this:
0 0 0 0
0 0 0 1
0 0 0 2
0 0 1 0
0 0 1 1
0 0 1 2
0 0 2 0
0 0 2 1
0 0 2 2
but how I can create an algorithm capable of this:
0 0 0 0
0 0 0 1
0 0 0 2
0 0 0 3
0 0 0 4
0 0 1 1
0 0 1 2
0 0 1 3
0 0 1 4
0 0 2 2
0 0 2 3
0 0 2 4
0 0 3 3
0 0 3 4
0 0 4 4
More important: every column of output must have different ranges, for example: first column: 0-4 second column: 0-10 etc.
And the number of columns ( a,b,c,d ) isn't fixed; depending on other parts of the program, can be in a range from 2 to 200.
UPDATE: to be more comprehensible and clear
what I need is something like that:
for a in range (0,10):
for b in range (a,10):
for c in range (b,10):
for d in range (c,10):
print(a,b,c,d)
the question is been partially resolved but still had problems on how to change the range parameters such like the above example.
Excuse me for the mess ! :)
itertools.product can already do exactly what you are looking for, simply by passing it multiple iterables (in this case the ranges you want). It will collect one element from each iterable passed. For example:
for a,b,c in product(range(2), range(3), range(4)):
print (a,b,c)
Outputs:
0 0 0
0 0 1
0 0 2
0 0 3
0 1 0
0 1 1
0 1 2
0 1 3
0 2 0
0 2 1
0 2 2
0 2 3
1 0 0
1 0 1
1 0 2
1 0 3
1 1 0
1 1 1
1 1 2
1 1 3
1 2 0
1 2 1
1 2 2
1 2 3
If your input ranges are variable, just place the loop in a function and call it with different parameters. You can also use something along the lines of
for elements in product(*(range(i) for i in [1,2,3,4])):
print(*elements)
if you have a large number of input iterables.
With your updated request for the variable ranges, a nice short-circuiting approach with itertools.product is not as clear, although you can always just check that each iterable is sorted in ascending order (as this is essentially what your variable ranges ensures). As per your example:
for elements in product(*(range(i) for i in [10,10,10,10])):
if all(elements[i] <= elements[i+1] for i in range(len(elements)-1)):
print(*elements)
You looking for something like this?
# the program would modify these variables below
column1_max = 2
column2_max = 3
column3_max = 4
column4_max = 5
# now generate the list
for a in range(column1_max+1):
for b in range(column2_max+1):
for c in range(column3_max+1):
for d in range(column4_max+1):
if c>d or b>c or a>b:
pass
else:
print a,b,c,d
Output:
0 0 0 0
0 0 0 1
0 0 0 2
0 0 0 3
0 0 0 4
0 0 0 5
0 0 1 1
0 0 1 2
0 0 1 3
0 0 1 4
0 0 1 5
0 0 2 2
0 0 2 3
0 0 2 4
0 0 2 5
0 0 3 3
0 0 3 4
0 0 3 5
0 0 4 4
0 0 4 5
0 1 1 1
0 1 1 2
...