How to cell values as new columns in pandas dataframe - python

I have a dataframe like the following:
Labels
1 Nail_Polish,Nails
2 Nail_Polish,Nails
3 Foot_Care,Targeted_Body_Care
4 Foot_Care,Targeted_Body_Care,Skin_Care
I want to generate the following matrix:
Nail_Polish Nails Foot_Care Targeted_Body_Care Skin_Care
1 1 1 0 0 0
2 1 1 0 0 0
3 0 0 1 1 0
4 0 0 1 1 1
How can I achieve this?

Use str.get_dummies:
df2 = df['Labels'].str.get_dummies(sep=',')
The resulting output:
Foot_Care Nail_Polish Nails Skin_Care Targeted_Body_Care
1 0 1 1 0 0
2 0 1 1 0 0
3 1 0 0 0 1
4 1 0 0 1 1

Related

Pandas: sort according to a row

I have a Dataframe like this (with labels on rows and columns):
0 1 2 3
0 1 1 0 0
1 0 1 1 0
2 1 0 1 0
-1 5 6 3 2
I would like to order the columns according to the last row (and then drop the row):
0 1 2 3
0 1 1 0 0
1 1 0 1 0
2 0 1 1 0
Try np.argsort to get the order, then iloc to rearrange columns and drop rows:
df.iloc[:-1, np.argsort(-df.iloc[-1])]
Output:
1 0 2 3
0 1 1 0 0
1 1 0 1 0
2 0 1 1 0

Convert one-hot encoded data-frame columns into one column

In the pandas data frame, the one-hot encoded vectors are present as columns, i.e:
Rows A B C D E
0 0 0 0 1 0
1 0 0 1 0 0
2 0 1 0 0 0
3 0 0 0 1 0
4 1 0 0 0 0
4 0 0 0 0 1
How to convert these columns into one data frame column by label encoding them in python? i.e:
Rows A
0 4
1 3
2 2
3 4
4 1
5 5
Also need suggestion on this that some rows have multiple 1s, how to handle those rows because we can have only one category at a time.
Try with argmax
#df=df.set_index('Rows')
df['New']=df.values.argmax(1)+1
df
Out[231]:
A B C D E New
Rows
0 0 0 0 1 0 4
1 0 0 1 0 0 3
2 0 1 0 0 0 2
3 0 0 0 1 0 4
4 1 0 0 0 0 1
4 0 0 0 0 1 5
argmaxis the way to go, adding another way using idxmax and get_indexer:
df['New'] = df.columns.get_indexer(df.idxmax(1))+1
#df.idxmax(1).map(df.columns.get_loc)+1
print(df)
Rows A B C D E New
0 0 0 0 1 0 4
1 0 0 1 0 0 3
2 0 1 0 0 0 2
3 0 0 0 1 0 4
4 1 0 0 0 0 1
5 0 0 0 0 1 5
Also need suggestion on this that some rows have multiple 1s, how to
handle those rows because we can have only one category at a time.
In this case you dot your DataFrame of dummies with an array of all the powers of 2 (based on the number of columns). This ensures that the presence of any unique combination of dummies (A, A+B, A+B+C, B+C, ...) will have a unique category label. (Added a few rows at the bottom to illustrate the unique counting)
df['Category'] = df.dot(2**np.arange(df.shape[1]))
A B C D E Category
Rows
0 0 0 0 1 0 8
1 0 0 1 0 0 4
2 0 1 0 0 0 2
3 0 0 0 1 0 8
4 1 0 0 0 0 1
5 0 0 0 0 1 16
6 1 0 0 0 1 17
7 0 1 0 0 1 18
8 1 1 0 0 1 19
Another readable solution on top of other great solutions provided that works for ANY type of variables in your dataframe:
df['variables'] = np.where(df.values)[1]+1
output:
A B C D E variables
0 0 0 0 1 0 4
1 0 0 1 0 0 3
2 0 1 0 0 0 2
3 0 0 0 1 0 4
4 1 0 0 0 0 1
5 0 0 0 0 1 5

pandas calculate difference based on indicators grouped by a column with duplicated grouped pair

Here is an example.
a b k c
0 0 0 0
0 1 1 0
0 2 0 0
0 3 0 0
0 4 1 0
0 5 0 0
0 0 0 1
0 1 1 1
0 2 0 1
0 3 0 1
0 4 1 1
0 5 0 1
1 0 0 0
1 1 1 0
1 2 0 0
1 3 1 0
1 4 0 0
1 0 0 1
1 1 1 1
1 2 0 1
1 3 1 1
1 4 0 1
Here, "a" is user id, "b" is time, 'c' is product and "k" is a binary indicator flag. For each c, "b" is consecutive for sure and binary flag 'k' of a unique pair (a,b) is same, which means it is independent with 'c'. What I want to get is this:
a b k c diff_b
0 0 0 0 nan
0 1 1 0 nan
0 2 0 0 1
0 3 0 0 2
0 4 1 0 3
0 5 0 0 1
0 0 0 1 nan
0 1 1 1 nan
0 2 0 1 1
0 3 0 1 2
0 4 1 1 3
0 5 0 1 1
1 0 0 0 nan
1 1 1 0 nan
1 2 0 0 1
1 3 1 0 2
1 4 0 0 1
1 0 0 1 nan
1 1 1 1 nan
1 2 0 1 1
1 3 1 1 2
1 4 0 1 1
So, diff_b is a time difference variable. It shows the duration between the current time point and the last time point with an action. If there is never an action before, it returns nan. This diff_b is grouped by a. For each user, this diff_b is calculated independently and for a same user but different product, it should be independent with product also.
Thank you.
You just need to adding the c into the group indicator at second step
df['New']=df.b.loc[df.k==1]# get all value b when k equal to 1
df.New=df.groupby(['a','c']).New.apply(lambda x : x.ffill().shift()) # fillna by froward method , then we need shift.
df.b-df['New']

How do I open a binary matrix and convert it into a 2D array or a dataframe?

I have a binary matrix in a txt file that looks as follows:
0011011000
1011011000
0011011000
0011011010
1011011000
1011011000
0011011000
1011011000
0100100101
1011011000
I want to make this into a 2D array or a dataframe where there is one number per column and the rows are as shown. I've tried using numpy and pandas, but the output has only one column that contains the whole number. I want to be able to call an entire column as a number.
One of the codes I've tried is:
with open("a1data1.txt") as myfile:
dat1=myfile.read().split('\n')
dat1=pd.DataFrame(dat1)
Use read_fwf with parameter widths:
df = pd.read_fwf("a1data1.txt", header=None, widths=[1]*10)
print (df)
0 1 2 3 4 5 6 7 8 9
0 0 0 1 1 0 1 1 0 0 0
1 1 0 1 1 0 1 1 0 0 0
2 0 0 1 1 0 1 1 0 0 0
3 0 0 1 1 0 1 1 0 1 0
4 1 0 1 1 0 1 1 0 0 0
5 1 0 1 1 0 1 1 0 0 0
6 0 0 1 1 0 1 1 0 0 0
7 1 0 1 1 0 1 1 0 0 0
8 0 1 0 0 1 0 0 1 0 1
9 1 0 1 1 0 1 1 0 0 0
After you read your txt, you can using following code fix it
pd.DataFrame(df[0].apply(list).values.tolist())
Out[846]:
0 1 2 3 4 5 6 7 8 9
0 0 0 1 1 0 1 1 0 0 0
1 1 0 1 1 0 1 1 0 0 0
2 0 0 1 1 0 1 1 0 0 0
3 0 0 1 1 0 1 1 0 1 0
4 1 0 1 1 0 1 1 0 0 0
5 1 0 1 1 0 1 1 0 0 0
6 0 0 1 1 0 1 1 0 0 0
7 1 0 1 1 0 1 1 0 0 0
8 0 1 0 0 1 0 0 1 0 1
9 1 0 1 1 0 1 1 0 0 0

for loop to extract header for a dataframe in pandas

I am a newbie in python. I have a data frame that looks like this:
A B C D E
0 1 0 1 0 1
1 0 1 0 0 1
2 0 1 1 1 0
3 1 0 0 1 0
4 1 0 0 1 1
How can I write a for loop to gather the column names for each row. I expect my result set looks like that:
A B C D E Result
0 1 0 1 0 1 ACE
1 0 1 0 0 1 BE
2 0 1 1 1 0 BCD
3 1 0 0 1 0 AD
4 1 0 0 1 1 ADE
Anyone can help me with that? Thank you!
The dot function is done for that purpose as you want the matrix dot product between your matrix and the vector of column names:
df.dot(df.columns)
Out[5]:
0 ACE
1 BE
2 BCD
3 AD
4 ADE
If your dataframe is numeric, then obtain the boolean matrix first by test your df against 0:
(df!=0).dot(df.columns)
PS: Just assign the result to the new column
df['Result'] = df.dot(df.columns)
df
Out[7]:
A B C D E Result
0 1 0 1 0 1 ACE
1 0 1 0 0 1 BE
2 0 1 1 1 0 BCD
3 1 0 0 1 0 AD
4 1 0 0 1 1 ADE

Categories