I have dataframe below
col
A a
A b
A c
B d
B e
C f
I would like to get dummy variable
a b c d e f
A 1 1 1 0 0 0
B 0 0 0 1 1 0
C 0 0 0 0 0 1
How can I get this?
I tried
df.col.get_dummies()
But I couldnt groupby.
You need groupby by index and aggregate max:
print (df.col.str.get_dummies().groupby(level=0).max())
a b c d e f
A 1 1 1 0 0 0
B 0 0 0 1 1 0
C 0 0 0 0 0 1
Or:
print (pd.get_dummies(df.col).groupby(level=0).max())
a b c d e f
A 1 1 1 0 0 0
B 0 0 0 1 1 0
C 0 0 0 0 0 1
Related
I have this data for example:
A
B
C
Class_label
0
1
1
B_C
1
1
1
A_B_C
0
0
1
C
How do you obtain (classified label column) this and count the common ones and display that as well using pandas dataframe?
Use DataFrame.assign for add new columns by DataFrame.dot with columns names for labels and sum for count 1, but only numeric columns selected by DataFrame.select_dtypes:
df1 = df.select_dtypes(np.number)
df = df.assign(classifiedlabel = df1.dot(df1.columns + '_').str[:-1],
countones = df1.sum(axis=1))
print (df)
A B C D classifiedlabel countones
0 0 1 0 1 B_D 2
1 1 1 0 1 A_B_D 3
2 0 0 1 0 C 1
3 0 1 1 0 B_C 2
If column classifiedlabel exist simpliest is use sum only:
df["countones"] = df.sum(axis=1)
print (df)
A B C D classifiedlabel countones
0 0 1 0 1 B_D 2
1 1 1 0 1 A_B_D 3
2 0 0 1 0 C 1
3 0 1 1 0 B_C 2
If values are 1/0 then you can use:
(
df.assign(
count=df._get_numeric_data().sum(axis=1)
)
)
Output:
A B C D classifiedlabel count
0 0 1 0 1 B_D 2
1 1 1 0 1 A_B_D 3
2 0 0 1 0 C 1
3 0 1 1 0 B_C 2
Try:
df["number_of_ones"] = (df == 1).astype(int).sum(axis=1)
print(df)
A B C D classifiedlabel number_of_ones
0 0 1 0 1 B_D 2
1 1 1 0 1 A_B_D 3
2 0 0 1 0 C 1
3 0 1 1 0 B_C 2
I have a dataset that looks like this:
df = pd.DataFrame(data= [[0,0,1],[1,0,0],[0,1,0]], columns = ['A','B','C'])
A B C
0 0 0 1
1 1 0 0
2 0 1 0
I want to create a new column where on each row appears the value of the previous column where there is a 1:
A B C value
0 0 0 1 C
1 1 0 0 A
2 0 1 0 B
Use dot:
df['value'] = df.values.dot(df.columns)
Output:
A B C value
0 0 0 1 C
1 1 0 0 A
2 0 1 0 B
Using pd.DataFrame.idxmax:
df['value'] = df.idxmax(1)
print(df)
A B C value
0 0 0 1 C
1 1 0 0 A
2 0 1 0 B
Let's assume that we have data like this(sorted by time) and created the dummy column for the classes in Pyspark dataframe:
ID class e_TYPE_B e_TYPE_C e_TYPE_L e_TYPE_A e_TYPE_E e_TYPE_G
1 G 0 0 0 0 0 1
1 B 1 0 0 0 0 0
1 B 1 0 0 0 0 0
2 E 0 0 0 0 1 0
2 E 0 0 0 0 1 0
2 C 0 1 0 0 0 0
2 C 0 1 0 0 0 0
2 E 0 0 0 0 1 0
2 E 0 0 0 0 1 0
3 L 0 0 1 0 0 0
3 L 0 0 1 0 0 0
3 B 1 0 0 0 0 0
3 E 0 0 0 0 1 0
4 A 0 0 0 1 0 0
4 A 0 0 0 1 0 0
5 B 1 0 0 0 0 0
5 B 1 0 0 0 0 0
5 A 0 0 0 1 0 0
5 A 0 0 0 1 0 0
Now, I am trying to findout the count of ID moving from one class to another.It can be consecutive or may have some other classes in between. The relation should be created for each class from top to bottom on ID basis.
For Example,
ID 1 goes from G to B then 1 should be added to G to B counter,
ID 2 goes from E to C then 1 should be added to E to C counter,
ID 2 goes from C to E then 1 should be added to C to E counter,
ID 3 goes from L to B then 1 should be added to L to B counter,
ID 3 goes from B to E then 1 should be added to B to E counter,
Also ID 3 goes from L to E then 1 should be added to L to E counter,
ID 4 have only one class so it should be discarded
I thought of using Window operation which should partition on ID, but how to iterate the partition to calculate the count of above class relation is that i am struggling on.
Please provide the solution/code snippet for this.
Thanks.
I have a data set in excel. A sample of the data is given below. Each row contains a number of items; one item in each column. The data has no headers either.
a b a d
g z f d a
e
dd gg dd g f r t
want to create a table which should look like below. It should count the items in each row and display the count by the row. I dont know apriori how many items are in the table.
row# a b d g z f e dd gg r t
1 2 1 1 0 0 0 0 0 0 0 0
2 1 0 1 1 1 1 0 0 0 0 0
3 0 0 0 0 0 0 1 0 0 0 0
4 0 0 0 1 0 1 0 2 1 1 1
I am not an expert in python and any assistance is very much appreciated.
Use get_dummies + sum:
df = pd.read_csv(file, names=range(100)).stack() # setup to account for missing values
df.str.get_dummies().sum(level=0)
a b d dd e f g gg r t z
0 2 1 1 0 0 0 0 0 0 0 0
1 1 0 1 0 0 1 1 0 0 0 1
2 0 0 0 0 1 0 0 0 0 0 0
3 0 0 0 2 0 1 1 1 1 1 0
Given a pandas DataFrame, how does one convert several binary columns (where 1 denotes the value exists, 0 denotes it doesn't) into a single categorical column?
Another way to think of this is how to perform the "reverse pd.get_dummies()"?
Here is an example of converting a categorical column into several binary columns:
import pandas as pd
s = pd.Series(list('ABCDAB'))
df = pd.get_dummies(s)
df
A B C D
0 1 0 0 0
1 0 1 0 0
2 0 0 1 0
3 0 0 0 1
4 1 0 0 0
5 0 1 0 0
What I would like to accomplish is given a dataframe
df1
A B C D
0 1 0 0 0
1 0 1 0 0
2 0 0 1 0
3 0 0 0 1
4 1 0 0 0
5 0 1 0 0
could do I convert it into
df1
A B C D category
0 1 0 0 0 A
1 0 1 0 0 B
2 0 0 1 0 C
3 0 0 0 1 D
4 1 0 0 0 A
5 0 1 0 0 B
One way would be to use idxmax to find the 1s:
In [32]: df["category"] = df.idxmax(axis=1)
In [33]: df
Out[33]:
A B C D category
0 1 0 0 0 A
1 0 1 0 0 B
2 0 0 1 0 C
3 0 0 0 1 D
4 1 0 0 0 A
5 0 1 0 0 B