I have a data set in excel. A sample of the data is given below. Each row contains a number of items; one item in each column. The data has no headers either.
a b a d
g z f d a
e
dd gg dd g f r t
want to create a table which should look like below. It should count the items in each row and display the count by the row. I dont know apriori how many items are in the table.
row# a b d g z f e dd gg r t
1 2 1 1 0 0 0 0 0 0 0 0
2 1 0 1 1 1 1 0 0 0 0 0
3 0 0 0 0 0 0 1 0 0 0 0
4 0 0 0 1 0 1 0 2 1 1 1
I am not an expert in python and any assistance is very much appreciated.
Use get_dummies + sum:
df = pd.read_csv(file, names=range(100)).stack() # setup to account for missing values
df.str.get_dummies().sum(level=0)
a b d dd e f g gg r t z
0 2 1 1 0 0 0 0 0 0 0 0
1 1 0 1 0 0 1 1 0 0 0 1
2 0 0 0 0 1 0 0 0 0 0 0
3 0 0 0 2 0 1 1 1 1 1 0
Related
My apologies SO community, I am a newbie on the platform and in the pursuit of making this question precise and straight to the point, I didn't give relevant info.
My Input Dataframe is:
import pandas as pd
data = {'user_id': ['abc','def','ghi'],
'alpha': ['A','B,C,D,A','B,C,A'],
'beta': ['1|20|30','350','376|98']}
df = pd.DataFrame(data = data, columns = ['user_id','alpha','beta'])
print(df)
Looks like this,
user_id alpha beta
0 abc A 1|20|30
1 def B,C,D,A 350
2 ghi B,C,A 376
I want something like this,
user_id alpha beta a_A a_B a_C a_D b_1 b_20 b_30 b_350 b_376
0 abc A 1|20|30 1 0 0 0 1 1 1 1 0
1 def B,C,D,A 350 1 1 1 1 0 0 0 1 0
2 ghi B,C,A 376 1 1 1 0 0 0 0 0 1
My original data contains 11K rows. And these distinct values in alpha & beta are around 550.
I created a list from all the values in alpha & beta columns and applied pd.get_dummies but it results in a lot of rows like the one displayed by #wwwnde. I would like all the rows to be rolled up based on user_id.
A similar idea is used by CountVectorizer on documents, where it creates columns based on all the words in the sentence and checks the frequency of a word. However, I am guessing Pandas has a better and efficient way to do that.
Grateful for all your assistance. :)
Desired Output
You will have to achieve that in a series of steps.
Sample Data
id ALPHA BETA
0 1 A 1|20|30
1 2 B,C,D,A 350
2 3 B,C,A 395|45|90
Create Lists for values in ALPHA and BETA
df.BETA=df.BETA.apply(lambda x: x.split('|'))#.str.join(',')
df.ALPHA=df.ALPHA.apply(lambda x: x.split(','))#.str.join(',')
Disintegrate the list elements into individuals
df=df.explode('ALPHA')
df=df.explode('BETA')
Extract the variable frequencies using get dummies.
pd.get_dummies(df)
Strip columns of the prefix
df.columns=df.columns.str.replace('ALPHA_|BETA_','')
id A B C D 1 20 30 350 395 45 90
0 1 1 0 0 0 1 0 0 0 0 0 0
0 1 1 0 0 0 0 1 0 0 0 0 0
0 1 1 0 0 0 0 0 1 0 0 0 0
1 2 0 1 0 0 0 0 0 1 0 0 0
1 2 0 0 1 0 0 0 0 1 0 0 0
1 2 0 0 0 1 0 0 0 1 0 0 0
1 2 1 0 0 0 0 0 0 1 0 0 0
2 3 0 1 0 0 0 0 0 0 1 0 0
2 3 0 1 0 0 0 0 0 0 0 1 0
2 3 0 1 0 0 0 0 0 0 0 0 1
2 3 0 0 1 0 0 0 0 0 1 0 0
2 3 0 0 1 0 0 0 0 0 0 1 0
2 3 0 0 1 0 0 0 0 0 0 0 1
2 3 1 0 0 0 0 0 0 0 1 0 0
2 3 1 0 0 0 0 0 0 0 0 1 0
2 3 1 0 0 0 0 0 0 0 0 0 1
In the pandas data frame, the one-hot encoded vectors are present as columns, i.e:
Rows A B C D E
0 0 0 0 1 0
1 0 0 1 0 0
2 0 1 0 0 0
3 0 0 0 1 0
4 1 0 0 0 0
4 0 0 0 0 1
How to convert these columns into one data frame column by label encoding them in python? i.e:
Rows A
0 4
1 3
2 2
3 4
4 1
5 5
Also need suggestion on this that some rows have multiple 1s, how to handle those rows because we can have only one category at a time.
Try with argmax
#df=df.set_index('Rows')
df['New']=df.values.argmax(1)+1
df
Out[231]:
A B C D E New
Rows
0 0 0 0 1 0 4
1 0 0 1 0 0 3
2 0 1 0 0 0 2
3 0 0 0 1 0 4
4 1 0 0 0 0 1
4 0 0 0 0 1 5
argmaxis the way to go, adding another way using idxmax and get_indexer:
df['New'] = df.columns.get_indexer(df.idxmax(1))+1
#df.idxmax(1).map(df.columns.get_loc)+1
print(df)
Rows A B C D E New
0 0 0 0 1 0 4
1 0 0 1 0 0 3
2 0 1 0 0 0 2
3 0 0 0 1 0 4
4 1 0 0 0 0 1
5 0 0 0 0 1 5
Also need suggestion on this that some rows have multiple 1s, how to
handle those rows because we can have only one category at a time.
In this case you dot your DataFrame of dummies with an array of all the powers of 2 (based on the number of columns). This ensures that the presence of any unique combination of dummies (A, A+B, A+B+C, B+C, ...) will have a unique category label. (Added a few rows at the bottom to illustrate the unique counting)
df['Category'] = df.dot(2**np.arange(df.shape[1]))
A B C D E Category
Rows
0 0 0 0 1 0 8
1 0 0 1 0 0 4
2 0 1 0 0 0 2
3 0 0 0 1 0 8
4 1 0 0 0 0 1
5 0 0 0 0 1 16
6 1 0 0 0 1 17
7 0 1 0 0 1 18
8 1 1 0 0 1 19
Another readable solution on top of other great solutions provided that works for ANY type of variables in your dataframe:
df['variables'] = np.where(df.values)[1]+1
output:
A B C D E variables
0 0 0 0 1 0 4
1 0 0 1 0 0 3
2 0 1 0 0 0 2
3 0 0 0 1 0 4
4 1 0 0 0 0 1
5 0 0 0 0 1 5
Let's assume that we have data like this(sorted by time) and created the dummy column for the classes in Pyspark dataframe:
ID class e_TYPE_B e_TYPE_C e_TYPE_L e_TYPE_A e_TYPE_E e_TYPE_G
1 G 0 0 0 0 0 1
1 B 1 0 0 0 0 0
1 B 1 0 0 0 0 0
2 E 0 0 0 0 1 0
2 E 0 0 0 0 1 0
2 C 0 1 0 0 0 0
2 C 0 1 0 0 0 0
2 E 0 0 0 0 1 0
2 E 0 0 0 0 1 0
3 L 0 0 1 0 0 0
3 L 0 0 1 0 0 0
3 B 1 0 0 0 0 0
3 E 0 0 0 0 1 0
4 A 0 0 0 1 0 0
4 A 0 0 0 1 0 0
5 B 1 0 0 0 0 0
5 B 1 0 0 0 0 0
5 A 0 0 0 1 0 0
5 A 0 0 0 1 0 0
Now, I am trying to findout the count of ID moving from one class to another.It can be consecutive or may have some other classes in between. The relation should be created for each class from top to bottom on ID basis.
For Example,
ID 1 goes from G to B then 1 should be added to G to B counter,
ID 2 goes from E to C then 1 should be added to E to C counter,
ID 2 goes from C to E then 1 should be added to C to E counter,
ID 3 goes from L to B then 1 should be added to L to B counter,
ID 3 goes from B to E then 1 should be added to B to E counter,
Also ID 3 goes from L to E then 1 should be added to L to E counter,
ID 4 have only one class so it should be discarded
I thought of using Window operation which should partition on ID, but how to iterate the partition to calculate the count of above class relation is that i am struggling on.
Please provide the solution/code snippet for this.
Thanks.
Given a pandas DataFrame, how does one convert several binary columns (where 1 denotes the value exists, 0 denotes it doesn't) into a single categorical column?
Another way to think of this is how to perform the "reverse pd.get_dummies()"?
Here is an example of converting a categorical column into several binary columns:
import pandas as pd
s = pd.Series(list('ABCDAB'))
df = pd.get_dummies(s)
df
A B C D
0 1 0 0 0
1 0 1 0 0
2 0 0 1 0
3 0 0 0 1
4 1 0 0 0
5 0 1 0 0
What I would like to accomplish is given a dataframe
df1
A B C D
0 1 0 0 0
1 0 1 0 0
2 0 0 1 0
3 0 0 0 1
4 1 0 0 0
5 0 1 0 0
could do I convert it into
df1
A B C D category
0 1 0 0 0 A
1 0 1 0 0 B
2 0 0 1 0 C
3 0 0 0 1 D
4 1 0 0 0 A
5 0 1 0 0 B
One way would be to use idxmax to find the 1s:
In [32]: df["category"] = df.idxmax(axis=1)
In [33]: df
Out[33]:
A B C D category
0 1 0 0 0 A
1 0 1 0 0 B
2 0 0 1 0 C
3 0 0 0 1 D
4 1 0 0 0 A
5 0 1 0 0 B
I am a newbie in python. I have a data frame that looks like this:
A B C D E
0 1 0 1 0 1
1 0 1 0 0 1
2 0 1 1 1 0
3 1 0 0 1 0
4 1 0 0 1 1
How can I write a for loop to gather the column names for each row. I expect my result set looks like that:
A B C D E Result
0 1 0 1 0 1 ACE
1 0 1 0 0 1 BE
2 0 1 1 1 0 BCD
3 1 0 0 1 0 AD
4 1 0 0 1 1 ADE
Anyone can help me with that? Thank you!
The dot function is done for that purpose as you want the matrix dot product between your matrix and the vector of column names:
df.dot(df.columns)
Out[5]:
0 ACE
1 BE
2 BCD
3 AD
4 ADE
If your dataframe is numeric, then obtain the boolean matrix first by test your df against 0:
(df!=0).dot(df.columns)
PS: Just assign the result to the new column
df['Result'] = df.dot(df.columns)
df
Out[7]:
A B C D E Result
0 1 0 1 0 1 ACE
1 0 1 0 0 1 BE
2 0 1 1 1 0 BCD
3 1 0 0 1 0 AD
4 1 0 0 1 1 ADE