Combine Pandas' startwith and isin - python

I need to get dataframe (all columns) where specific isin value appears only in startwith columns.
Currently I have codes that do both steps seperate but need them to happen in one line:
Dataframe:
col1 col2 filter_col3 filter_col4
0 0 0 1 0
1 0 0 0 1
2 0 0 0 0
#get only columns that start with str
filter_col = [c for c in df if c.startwith('filter_')]
#return row when 1 appears in any of the columns
df[df.isin([1]).any(axis=1)]
Expected result:
col1 col2 filter_col3 filter_col4
0 0 0 1 0
1 0 0 0 1

Try with pd.DataFrame.filter:
# also `.eq(1)` instead of `.isin([1])`
df[df.filter(regex='^filter_').isin([1]).any(1)]
Output:
col1 col2 filter_col3 filter_col4
0 0 0 1 0
1 0 0 0 1

You can also use df.filter with Series.eq:
In [2567]: df[df.filter(like='filter').eq(1).any(1)]
Out[2567]:
col1 col2 filter_col3 filter_col4
0 0 0 1 0
1 0 0 0 1

Related

Create Combined Column From Three Columns With The Most Data (Not NA)

I have data like:
import pandas as pd
df = pd.DataFrame(data=[[1,-2,3,0,0], [0,0,0,4,0], [0,0,0,0,5]]).T
df.columns = ['col1', 'col2', 'col3']
> df
col1 col2 col3
1 0 0
-2 0 0
3 0 0
0 4 0
0 0 5
I want to create a fourth ("Col4") that takes the col that is non-zero.
So result would be:
col1 col2 col3 col4
1 0 0 1
-2 0 0 -2
3 0 0 3
0 4 0 4
0 0 5 5
EDIT: If two non-zero, always use col1. Also, the numbers may be negative. I have updated the df to reflect this.
Using the maximum of the columns is a possibility
df['col4'] = df.max(axis=1)
Here's an example:
def func(a):
a = set(a)
assert len(a)==2 # 0 and another number
for i in a:
if i!=0:
return i
df['col4'] = df.apply(func,axis=1)

removing redundant columns when using get_dummies

Hi have a pandas dataframe df containing categorical variables.
df=pandas.DataFrame(data=[['male','blue'],['female','brown'],
['male','black']],columns=['gender','eyes'])
df
Out[16]:
gender eyes
0 male blue
1 female brown
2 male black
using the function get_dummies I get the following dataframe
df_dummies = pandas.get_dummies(df)
df_dummies
Out[18]:
gender_female gender_male eyes_black eyes_blue eyes_brown
0 0 1 0 1 0
1 1 0 0 0 1
2 0 1 1 0 0
Owever the columns gender_female and gender_male contain the same information because the original column could assume a binary value. Is there a (smart) way to keep only one of the 2 columns?
UPDATED
The use of
df_dummies = pandas.get_dummies(df,drop_first=True)
Would give me
df_dummies
Out[21]:
gender_male eyes_blue eyes_brown
0 1 1 0
1 0 0 1
2 1 0 0
but I would like to remove the columns for which originally I had only 2 possibilities
The desired result should be
df_dummies
Out[18]:
gender_male eyes_black eyes_blue eyes_brown
0 1 0 1 0
1 0 0 0 1
2 1 1 0 0
Yes, you can use the argument dropfirst:
drop_first=True
From the documentation:
pd.get_dummies(pd.Series(list('abcaa')), drop_first=True)
b c
0 0 0
1 1 0
2 0 1
3 0 0
4 0 0
To have all dummy columns for eyes, and one for gender, use this:
df = pd.get_dummies(df, prefix=['eyes'], columns=['eyes'])
df = pd.get_dummies(df,drop_first=True)
Output:
eyes_black eyes_blue eyes_brown gender_male
0 0 1 0 1
1 0 0 1 0
2 1 0 0 1
More general:
gender eyes heigh
0 male blue tall
1 female brown short
2 male black average
for i in df.columns:
if len(df.groupby([i]).size()) > 2:
df = pd.get_dummies(df, prefix=[i], columns=[i])
df = pd.get_dummies(df, drop_first=True)
Output:
eyes_black eyes_blue eyes_brown heigh_average heigh_short heigh_tall \
0 0 1 0 0 0 1
1 0 0 1 0 1 0
2 1 0 0 1 0 0
gender_male
0 1
1 0
2 1
You could use itertools.combinations to find all pairs of columns, then any potentially redundant pair of columns will be one where for every row one column is True and the other is False - i.e. an XOR:
import pandas as pd
from itertools import combinations
df = pd.DataFrame(data=[['male','blue'],['female','brown'],['male','black']],
columns=['gender','eyes'])
dummies = pd.get_dummies(df)
for c1, c2 in combinations(dummies.columns, 2):
if all(dummies[c1] ^ dummies[c2]):
print(c1,c2)
However, this also notices that in your examples all females have brown eyes, hence we get the following printed:
gender_female gender_male
gender_male eyes_brown
Alternatively, you can split the dataframe into parts you want to apply drop_first=True and parts you don't. Then concatenate them together.
df1 = df.iloc[:, 0:2]
df2 = df.iloc[:, 2:]
df1 = pd.get_dummies(df1 ,drop_first=True)
df = pd.concat([df1, df2], axis=1)

Dummy variables when not all categories are present across multiple features & data sets

I want to ask an extension of this question, which talks about adding a label to missing classes to make sure the dummies are encoded as blanks correctly.
Is there a way to do this automatically across multiple sets of data and have the labels automatically synched between the two? (I.e. for Test & Training sets). I.e. the same columns but different classes of data represented in each?
E.g.:
Suppose I have the following two dataframes:
df1 = pd.DataFrame.from_items([('col1', list('abc')), ('col2', list('123'))])
df2 = pd.DataFrame.from_items([('col1', list('bcd')), ('col2', list('234'))])
df1
col1 col2
1 a 1
2 b 2
3 c 3
df2
col1 col2
1 b 2
2 c 3
3 d 4
I want to have:
df1
col1_a col1_b col1_c col1_d col2_1 col2_2 col2_3 col2_4
1 1 0 0 0 1 0 0 0
2 0 1 0 0 0 1 0 0
3 0 0 1 0 0 0 1 0
df2
col1_a col1_b col1_c col1_d col2_1 col2_2 col2_3 col2_4
1 0 1 0 0 0 1 0 0
2 0 0 1 0 0 0 1 0
3 0 0 0 1 0 0 0 1
WITHOUT having to specify in advance that
col1_labels = ['a', 'b', 'c', 'd'], col2_labels = ['1', '2', '3', '4']
And can I do this systematically for many columns all at once? I'm imagining a fuction that when passed in two or more dataframes (assuming columns are the same for all):
reads which columns in the pandas dataframe are categories
figures out what that overall labels are
and then provides the category labels to each column
Does that seem right? Is there a better way?
I think you need reindex by union of all columns if same categorical columns names in both Dataframes:
print (df1)
df1
1 a
2 b
3 c
print (df2)
df1
1 b
2 c
3 d
df1 = pd.get_dummies(df1)
df2 = pd.get_dummies(df2)
union = df1.columns | df2.columns
df1 = df1.reindex(columns=union, fill_value=0)
df2 = df2.reindex(columns=union, fill_value=0)
print (df1)
df1_a df1_b df1_c df1_d
1 1 0 0 0
2 0 1 0 0
3 0 0 1 0
print (df2)
df1_a df1_b df1_c df1_d
1 0 1 0 0
2 0 0 1 0
3 0 0 0 1

Creating a dataframe with binary valued columns with pandas using values from an existing dataframe

I am trying to create a new dataframe with binary (0 or 1) values from an exisitng dataframe. For every row in the given dataframe, the program should take value from each cell and set 1 for the corresponding columns of the row indexed with same number in the new dataframe
I have tried executing the following code snippet.
for col in products :
index = 0;
for item in products.loc[col] :
products_coded.ix[index, 'prod_' + str(item)] = 1;
index = index + 1;
It works for less number of rows. But,it takes lot of time for any large dataset. What could be the best way to get the desired outcome.
I think you need:
first get_dummies with casting values to strings
aggregate max by columns names max
for correct ordering convert columns to int
reindex for ordering and append missing columns, replace NaNs by 0 by parameter fill_value=0 and remove first 0 column
add_prefix for rename columns
df = pd.DataFrame({'B':[3,1,12,12,8],
'C':[0,6,0,14,0],
'D':[0,14,0,0,0]})
print (df)
B C D
0 3 0 0
1 1 6 14
2 12 0 0
3 12 14 0
4 8 0 0
df1 = (pd.get_dummies(df.astype(str), prefix='', prefix_sep='')
.max(level=0, axis=1)
.rename(columns=lambda x: int(x))
.reindex(columns=range(1, df.values.max() + 1), fill_value=0)
.add_prefix('prod_'))
print (df1)
prod_1 prod_2 prod_3 prod_4 prod_5 prod_6 prod_7 prod_8 prod_9 \
0 0 0 1 0 0 0 0 0 0
1 1 0 0 0 0 1 0 0 0
2 0 0 0 0 0 0 0 0 0
3 0 0 0 0 0 0 0 0 0
4 0 0 0 0 0 0 0 1 0
prod_10 prod_11 prod_12 prod_13 prod_14
0 0 0 0 0 0
1 0 0 0 0 1
2 0 0 1 0 0
3 0 0 1 0 1
4 0 0 0 0 0
Another similar solution:
df1 = (pd.get_dummies(df.astype(str), prefix='', prefix_sep='')
.max(level=0, axis=1))
df1.columns = df1.columns.astype(int)
df1 = (df1.reindex(columns=range(1, df1.columns.max() + 1), fill_value=0)
.add_prefix('prod_'))

Quickest way to make a get_dummies type dataframe from a column with a multiple of strings

I have a column, 'col2', that has a list of strings. The current code I have is too slow, there's about 2000 unique strings (the letters in the example below), and 4000 rows. Ending up as 2000 columns and 4000 rows.
In [268]: df.head()
Out[268]:
col1 col2
0 6 A,B
1 15 C,G,A
2 25 B
Is there a fast way to make this in a get dummies format? Where each string has it's own column and in each string's column there is a 0 or 1 if it that row has that string in col2.
In [268]: def get_list(df):
d = []
for row in df.col2:
row_list = row.split(',')
for string in row_list:
if string not in d:
d.append(string)
return d
df_list = get_list(df)
def make_cols(df, lst):
for string in lst:
df[string] = 0
return df
df = make_cols(df, df_list)
for idx in range(0, len(df['col2'])):
row_list = df['col2'].iloc[idx].split(',')
for string in row_list:
df[string].iloc[idx]+= 1
Out[113]:
col1 col2 A B C G
0 6 A,B 1 1 0 0
1 15 C,G,A 1 0 1 1
2 25 B 0 1 0 0
This is my current code for it but it's too slow.
Thanks you any help!
You can use:
>>> df['col2'].str.get_dummies(sep=',')
A B C G
0 1 1 0 0
1 1 0 1 1
2 0 1 0 0
To join the Dataframes:
>>> pd.concat([df, df['col2'].str.get_dummies(sep=',')], axis=1)
col1 col2 A B C G
0 6 A,B 1 1 0 0
1 15 C,G,A 1 0 1 1
2 25 B 0 1 0 0

Categories