Finding columns which are unique to a row in Pandas dataframe

Finding columns which are unique to a row in Pandas dataframe - python

I have a dataframe of the below structure. I want to get the column numbers which are unique to a particular row.
1 1 0 1 1 1 0 0 0
0 1 0 1 0 0 0 0 0
0 1 0 0 1 0 0 0 0
1 0 0 0 1 0 0 0 1
0 0 0 0 0 0 1 1 0
1 0 0 0 1 0 0 0 0
In the above example I should get coln6, coln7, coln8, coln9 (as there is only one row which has a value specific to these columns). Also I should be able to distinguish among the columns like coln7 and coln8 should group together as they are unique to the same row. Is there an efficient solution in Python for this?

You can call sum on the df and compare against 1 and use this to mask the columns:
In [19]:
df.columns[df.sum(axis=0) == 1]
Out[19]:
Int64Index([5, 6, 7, 8], dtype='int64')

Here is my first approach:
import numpy as np
import pandas as pd
df = pd.DataFrame(np.array([
1,1,0,1,1,1,0,0,0,
0,1,0,1,0,0,0,0,0,
0,1,0,0,1,0,0,0,0,
1,0,0,0,1,0,0,0,1,
0,0,0,0,0,0,1,1,0,
1,0,0,0,1,0,0,0,0]).reshape(6,9))
print df.sum(axis=0).apply(lambda x: True if x == 1 else False)
Output:
0 False
1 False
2 False
3 False
4 False
5 True
6 True
7 True
8 True
dtype: bool

Related

How to pivot dataframe into ML format

My head is spinning trying to figure out if I have to use pivot_table, melt, or some other function.
I have a DF that looks like this:
month day week_day classname_en origin destination
0 1 7 2 1 2 5
1 1 2 6 2 1 167
2 2 1 5 1 2 54
3 2 2 6 4 1 6
4 1 2 6 5 6 1
But I want to turn it into something like:
month_1 month_2 ...classname_en_1 classname_en_2 ... origin_1 origin_2 ...destination_1
0 1 0 1 0 0 1 0
1 1 0 0 1 1 0 0
2 0 1 1 0 0 1 0
3 0 1 0 0 1 0 0
4 1 0 0 0 0 0 1
Basically, turn all values into columns and then have binary rows 1 - if the column is present, 0 if none.
IDK if it is at all possible to do with like a single function or not, but would appreciate all and any help!

To expand #Corraliens answer
It is indeed a way to do it, but since you write for ML purposes, you might introduce a bug.
With the code above you get a matrix with 20 features. Now, say you want to predict on some data which suddenly have a month more than your training data, then your matrix on your prediction data would have 21 features, thus you cannot parse that into your fitted model.
To overcome this you can use one-hot-encoding from Sklearn. It'll make sure that you always have the same amount of features on "new data" as your training data.
import pandas as pd
df_train = pd.DataFrame({"color":["red","blue"],"age":[10,15]})
pd.get_dummies(df_train)
# output
age color_blue color_red
0 10 0 1
1 15 1 0
df_new = pd.DataFrame({"color":["red","blue","green"],"age":[10,15,20]})
pd.get_dummies(df_new)
#output
age color_blue color_green color_red
0 10 0 0 1
1 15 1 0 0
2 20 0 1 0
and as you can see, the order of the color-binary representation has also changed.
If we on the other hand use OneHotEncoder you can ommit all those issues
from sklearn.preprocessing import OneHotEncoder
df_train = pd.DataFrame({"color":["red","blue"],"age":[10,15]})
ohe = OneHotEncoder(handle_unknown="ignore")
color_ohe_transformed= ohe.fit_transform(df_train[["color"]]) #creates sparse matrix
ohe_features = ohe.get_feature_names_out() # [color_blue, color_red]
pd.DataFrame(color_ohe_transformed.todense(),columns = ohe_features, dtype=int)
# output
color_blue color_red
0 0 1
1 1 0
# now transform new data
df_new = pd.DataFrame({"color":["red","blue","green"],"age":[10,15,20]})
new_data_ohe_transformed = ohe.transform(df_new[["color"]])
pd.DataFrame(new_data_ohe_transformed .todense(),columns = ohe_features, dtype=int)
#output
color_blue color_red
0 0 1
1 1 0
2 0 0
note in the last row that both blue and red are both zeros since it has color= "green" which was not present in the training data.
Note the todense() function is only used here to illustrate how it works. Ususally you would like to keep it a sparse matrix and use e.g scipy.sparse.hstack to append your other features such as age to it.

Use pd.get_dummies:
out = pd.get_dummies(df, columns=df.columns)
print(out)
# Output
month_1 month_2 day_1 day_2 day_7 week_day_2 week_day_5 ... origin_2 origin_6 destination_1 destination_5 destination_6 destination_54 destination_167
0 1 0 0 0 1 1 0 ... 1 0 0 1 0 0 0
1 1 0 0 1 0 0 0 ... 0 0 0 0 0 0 1
2 0 1 1 0 0 0 1 ... 1 0 0 0 0 1 0
3 0 1 0 1 0 0 0 ... 0 0 0 0 1 0 0
4 1 0 0 1 0 0 0 ... 0 1 1 0 0 0 0
[5 rows x 20 columns]

You can use get_dummies function of pandas for convert row to column based on data.
For that your code will be:
import pandas as pd
df = pd.DataFrame({
'month': [1, 1, 2, 2, 1],
'day': [7, 2, 1, 2, 2],
'week_day': [2, 6, 5, 6, 6],
'classname_en': [1, 2, 1, 4, 5],
'origin': [2, 1, 2, 1, 6],
'destination': [5, 167, 54, 6, 1]
})
response = pd.get_dummies(df, columns=df.columns)
print(response)
Result :

Parsing values to specific columns in Pandas

I would like to use Pandas to parse Q26 Challenges into the subsequent columns, with a "1" representing its presence in the original unparsed column. So the data frame initially looks like this:
ID
Q26 Challenges
Q26_1
Q26_2
Q26_3
Q26_4
Q26_5
Q26_6
Q26_7
1
5
0
0
0
0
0
0
0
2
1,2
0
0
0
0
0
0
0
3
1,3,7
0
0
0
0
0
0
0
And I want it to look like this:
ID
Q26 Challenges
Q26_1
Q26_2
Q26_3
Q26_4
Q26_5
Q26_6
Q26_7
1
5
0
0
0
0
1
0
0
2
1,2
1
1
0
0
0
0
0
3
1,3,7
1
0
1
0
0
0
1

You can iterate over the range of values in Q26 Challenges, using str.contains to check if the current value is contained in the string and then converting that boolean value to an integer. For example:
df = pd.DataFrame({'id' : [1, 2, 3, 4, 5], 'Q26 Challenges': ['0', '1,2', '2', '1,2,6,7', '3,4,5,11' ] })
for i in range(1, 12):
df[f'Q26_{i}'] = df['Q26 Challenges'].str.contains(rf'\b{i}\b').astype(int)
df
Output:
id Q26 Challenges Q26_1 Q26_2 Q26_3 Q26_4 Q26_5 Q26_6 Q26_7 Q26_8 Q26_9 Q26_10 Q26_11
0 1 0 0 0 0 0 0 0 0 0 0 0 0
1 2 1,2 1 1 0 0 0 0 0 0 0 0 0
2 3 2 0 1 0 0 0 0 0 0 0 0 0
3 4 1,2,6,7 1 1 0 0 0 1 1 0 0 0 0
4 5 3,4,5,11 0 0 1 1 1 0 0 0 0 0 1

str.get_dummies can be used on the 'Q26 Challenges' column to create the indicator values. This indicator DataFrame can be reindexed to include the complete result range (note column headers will be of type string). add_prefix can be used to add the 'Q26_' to the column headers. Lastly, join back to the original DataFrame:
df = df.join(
df['Q26 Challenges'].str.get_dummies(sep=',')
.reindex(columns=map(str, range(1, 8)), fill_value=0)
.add_prefix('Q26_')
)
The reindexing can also be done dynamically based on the resulting columns. It is necessary to convert the resulting column headers to numbers first to ensure numeric order, rather than lexicographic ordering:
s = df['Q26 Challenges'].str.get_dummies(sep=',')
# Convert to numbers to correctly access min and max
s.columns = s.columns.astype(int)
# Add back to DataFrame
df = df.join(s.reindex(
# Build range from the min column to max column values
columns=range(min(s.columns), max(s.columns) + 1),
fill_value=0
).add_prefix('Q26_'))
Both options produce:
ID Q26 Challenges Q26_1 Q26_2 Q26_3 Q26_4 Q26_5 Q26_6 Q26_7
0 1 5 0 0 0 0 1 0 0
1 2 1,2 1 1 0 0 0 0 0
2 3 1,3,7 1 0 1 0 0 0 1
Given initial input:
import pandas as pd
df = pd.DataFrame({
'ID': [1, 2, 3],
'Q26 Challenges': ['5', '1,2', '1,3,7']
})
ID Q26 Challenges
0 1 5
1 2 1,2
2 3 1,3,7

Python - Column-wise keep first unique value

I have a dataframe that has multiple columns that represent whether or not something had existed, but they are ordinal in nature. Something could have existed in all 3 categories, but I only want to indicate the highest level that it existed in.
So for a given row, i only want a single '1' value , but I want it to be kept at the highest level it was found at.
For this row:
1,1,0 , I would want the row to be changed to 1,0,0
and this row:
0,1,1 , I would want the row to be changed to 0,1,0
Here is a sample of what the data could look like, and expected output:
import pandas as pd
#input data
df = pd.DataFrame({'id':[1,2,3,4,5],
'level1':[0,0,0,0,1],
'level2':[1,0,1,0,1],
'level3':[0,1,1,1,0]})
#expected output:
new_df = pd.DataFrame({'id':[1,2,3,4,5],
'level1':[0,0,0,0,1],
'level2':[1,0,1,0,0],
'level3':[0,1,0,1,0]})

Using numpy.zeros and filling via numpy.argmax:
out = np.zeros(df.iloc[:, 1:].shape, dtype=int)
out[np.arange(len(out)), np.argmax(df.iloc[:, 1:].values, 1)] = 1
df.iloc[:, 1:] = out
Using broadcasting with argmax:
a = df.iloc[:, 1:].values
df.iloc[:, 1:] = (a.argmax(axis=1)[:,None] == range(a.shape[1])).astype(int)
Both produce:
id level1 level2 level3
0 1 0 1 0
1 2 0 0 1
2 3 0 1 0
3 4 0 0 1
4 5 1 0 0

You can use advanced indexing with NumPy. Updating underlying NumPy array works here since you have a dataframe of int dtype.
idx = df.iloc[:, 1:].eq(1).values.argmax(1)
df.iloc[:, 1:] = 0
df.values[np.arange(df.shape[0]), idx+1] = 1
print(df)
id level1 level2 level3
0 1 0 1 0
1 2 0 0 1
2 3 0 1 0
3 4 0 0 1
4 5 1 0 0

numpy.eye
v = df.iloc[:, 1:].values
i = np.eye(3, dtype=np.int64)
a = v.argmax(1)
df.iloc[:, 1:] = i[a]
df
id level1 level2 level3
0 1 0 1 0
1 2 0 0 1
2 3 0 1 0
3 4 0 0 1
4 5 1 0 0
cumsum and mask
df.set_index('id').pipe(
lambda d: d.mask(d.cumsum(1) > 1, 0)
).reset_index()
id level1 level2 level3
0 1 0 1 0
1 2 0 0 1
2 3 0 1 0
3 4 0 0 1
4 5 1 0 0

You can use get_dummies() by assigning a 1 to the maximum index
df[df.filter(like='level').columns] = pd.get_dummies(df.filter(like='level').idxmax(1))
id level1 level2 level3
0 1 0 1 0
1 2 0 0 1
2 3 0 1 0
3 4 0 0 1
4 5 1 0 0

Creating a dataframe with binary valued columns with pandas using values from an existing dataframe

I am trying to create a new dataframe with binary (0 or 1) values from an exisitng dataframe. For every row in the given dataframe, the program should take value from each cell and set 1 for the corresponding columns of the row indexed with same number in the new dataframe
I have tried executing the following code snippet.
for col in products :
index = 0;
for item in products.loc[col] :
products_coded.ix[index, 'prod_' + str(item)] = 1;
index = index + 1;
It works for less number of rows. But,it takes lot of time for any large dataset. What could be the best way to get the desired outcome.

I think you need:
first get_dummies with casting values to strings
aggregate max by columns names max
for correct ordering convert columns to int
reindex for ordering and append missing columns, replace NaNs by 0 by parameter fill_value=0 and remove first 0 column
add_prefix for rename columns
df = pd.DataFrame({'B':[3,1,12,12,8],
'C':[0,6,0,14,0],
'D':[0,14,0,0,0]})
print (df)
B C D
0 3 0 0
1 1 6 14
2 12 0 0
3 12 14 0
4 8 0 0
df1 = (pd.get_dummies(df.astype(str), prefix='', prefix_sep='')
.max(level=0, axis=1)
.rename(columns=lambda x: int(x))
.reindex(columns=range(1, df.values.max() + 1), fill_value=0)
.add_prefix('prod_'))
print (df1)
prod_1 prod_2 prod_3 prod_4 prod_5 prod_6 prod_7 prod_8 prod_9 \
0 0 0 1 0 0 0 0 0 0
1 1 0 0 0 0 1 0 0 0
2 0 0 0 0 0 0 0 0 0
3 0 0 0 0 0 0 0 0 0
4 0 0 0 0 0 0 0 1 0
prod_10 prod_11 prod_12 prod_13 prod_14
0 0 0 0 0 0
1 0 0 0 0 1
2 0 0 1 0 0
3 0 0 1 0 1
4 0 0 0 0 0
Another similar solution:
df1 = (pd.get_dummies(df.astype(str), prefix='', prefix_sep='')
.max(level=0, axis=1))
df1.columns = df1.columns.astype(int)
df1 = (df1.reindex(columns=range(1, df1.columns.max() + 1), fill_value=0)
.add_prefix('prod_'))

Pandas - Map - Dummy Variables - Assign value of 1

I have two dataframes, x.head() looks like this:
top mid adc support jungle
Irelia Ahri Jinx Janna RekSai
Gnar Ahri Caitlyn Leona Rengar
Renekton Fizz Sivir Annie Rengar
Irelia Leblanc Sivir Thresh JarvanIV
Gnar Lissandra Tristana Janna JarvanIV
and dataframe fullmatrix.head() that I have created looks like this:
Irelia Gnar Renekton Kassadin Sion Jax Lulu Maokai Rumble Lissandra ... XinZhao Amumu Udyr Ivern Shaco Skarner FiddleSticks Aatrox Volibear MonkeyKing
0 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
1 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
2 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
3 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
4 0 0 0 0 0 0 0 0 0 0 ...
Now what I cannot figure out is how to assign a value of 1 for each name in the x dataframe to the respective column that has the same name in the fullmatrix dataframe row by row (both dataframes have the same number of rows).

I'm sure this can be improved but one advantage is that it only requires the first DataFrame, and it's conceptually nice to chain operations until you get the desired solution.
fullmatrix = (x.stack()
.reset_index(name='names')
.pivot(index='level_0', columns='names', values='names')
.applymap(lambda x: int(x!=None))
.reset_index(drop=True))
note that only the names that appear in your x DataFrame will appear as columns in fullmatrix. if you want the additional columns you can simply perform a join.

Consider adding a key = 1 column and then iterating through each column for a list of pivoted dfs which you then horizontally merge with pd.concat. Finally run a DataFrame.update() to update original fullmatrix with values from pvt_df, aligned to indices.
x['key'] = 1
dfs = []
for col in x.columns[:-1]:
dfs.append(x.pivot_table(index=df.index, columns=[col], values='key').fillna(0))
pvt_df = pd.concat(dfs, axis=1).astype(int)
fullmatrix.update(pvt_df)
fullmatrix = fullmatrix.astype(int)
fullmatrix # ONLY FOR VISIBLE COLUMNS IN ORIGINAL POST
# Irelia Gnar Renekton Kassadin Sion Jax Lulu Maokai Rumble Lissandra XinZhao Amumu Udyr Ivern Shaco Skarner FiddleSticks Aatrox Volibear MonkeyKing
# 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
# 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
# 2 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
# 3 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

The OP tries to create a table of dummy variables with a set of data points. For each data point, it contains 5 attributes. There are in total N unique attributes.
We will use a simplied dataset to demonstrate how to do it:
5 unique attributes
3 data entries
each data entry contains 3 attributes.
x = pd.DataFrame([['a', 'b', 'c'],
['b', 'd', 'e'],
['e', 'b', 'a']])
fullmatrix = pd.DataFrame([[0 for _ in range(5)] for _ in range(3)],
columns=['a','b','c','d','e'])
""" fullmatrix:
a b c d e
0 0 0 0 0 0
1 0 0 0 0 0
2 0 0 0 0 0
"""
# each row in x_temp is a string of attributed delimited by ","
x_row_joined = pd.Series((",".join(row[1]) for row in x.iterrows()))
fullmatrix = x_row_joined.str.get_dummies(sep=',')
The method is inspired by offbyone's answer It uses pandas.Series.str.get_dummies. We first joins each row of x with a specified delimiter. Then make use of the Series.str.get_dummies method. The method takes a delimiter that we just use to join attributes and will generate the dummy-varaible table for you. (Caution: don't pick sep that exists in x.)

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Finding columns which are unique to a row in Pandas dataframe - python

You can call sum on the df and compare against 1 and use this to mask the columns: In [19]: df.columns[df.sum(axis=0) == 1] Out[19]: Int64Index([5, 6, 7, 8], dtype='int64')

Related

How to pivot dataframe into ML format

Parsing values to specific columns in Pandas

Python - Column-wise keep first unique value

Creating a dataframe with binary valued columns with pandas using values from an existing dataframe

Pandas - Map - Dummy Variables - Assign value of 1

Categories

Resources