Featurizer to eliminate features

Featurizer to eliminate features - python

I am trying to set up a featurizers which drops out all but the first 10 columns of my database. The database consists of 76 columns in total. The idea is to apply a PolynomialFeatures(1)) to the 10 columns I would like to keep, but then I cannot see a way to eliminate smartly the remaining 66 columns (I was thinking something like PolynomialFeatures(0)) but it does not seem to work. The idea was to multiply them by the constant 0). The issues are basically 2: 1) how to tell DataFrameMapper to apply the same featurizer to a range of columns (namely A_11 to A_76); 2) how to tell DataFrameMapper to apply aa featurizer that eliminates such columns.
The (incomplete) code I tried so far looks as follows. I denoted A_11-A_76 the issue 1) (i.e. the range) and as ? the issue 2 in the code:
from dml_iv.utilities import SubsetWrapper, ConstantModel
from econml.sklearn_extensions.linear_model import StatsModelsLinearRegression
col = ["A_"+str(k) for k in range(XW.shape[1])]
XW_db = pd.DataFrame(XW, columns=col)
from sklearn_pandas import DataFrameMapper
subset_names = set(['A_0','A_1','A_2','A_3','A_4','A_5','A_6','A_7','A_8','A_9','A_10'])
# list of indices of features X to use in the final model
mapper = DataFrameMapper([
('A_0', PolynomialFeatures(1)),
('A_1', PolynomialFeatures(1)),
('A_2', PolynomialFeatures(1)),
('A_3', PolynomialFeatures(1)),
('A_4', PolynomialFeatures(1)),
('A_5', PolynomialFeatures(1)),
('A_11 - A_66', ?)]) ## PROBLEMATIC PART

Why don't you drop columns you don't want from your dataframe and map what's left?
cols_map = [...] # list of columns to map
cols_drop = [...] # list of columns to drop
XW_db = XW_db.drop(cols_drop, axis=1) # you're left with only what to map
mapper = DataFrameMapper(cols_map)
...
If the reason for not wanting to drop columns is that they will be used later, you can simply assign the result of your drop to other variables, thus creating several subset dataframes which are easier to manipulate:
df2 = df1.drop(cols_drop2,axis=1) # df2 is a subset of df1
df3 = df1.drop(cols_drop3,axis=1) # df3 is a subset of df1
# Alternative is to decide what to keep instead of what to drop
df4 = df1[cols_keep] # df4 is a subset of df1
# df1 remains the full dataframe

Related

Best way to move an unexpected column in a Pandas DF to a new DF?

Wondering what the best way to tackle this issue is. If I have a DF with the following columns
df1()
type_of_fruit name_of_fruit price
..... ..... .....
and a list called
expected_cols = ['name_of_fruit','price']
whats the best way to automate the check of df1 against the expected_cols list? I was trying something like
df_cols=df1.columns.values.tolist()
if df_cols != expected_cols:
And then try to drop to another df any columns not in expected_cols, but this doesn't seem like a great idea to me. Is there a way to save the "dropped" columns?
df2 = df1.drop(columns=expected_cols)
But then this seems problematic depending on column ordering, and also in cases where the columns could have either more values than expected, or less values than expected. In cases where there are less values than expected (ie the df1 only contains the column name_of_fruit) I'm planning on using
df1.reindex(columns=expected_cols)
But a bit iffy on how to do this programatically, and then how to handle the issue where there are more columns than expected.

You can use set difference using -:
Assuming df1 having cols:
In [542]: df1_cols = df1.columns # ['type_of_fruit', 'name_of_fruit', 'price']
In [539]: expected_cols = ['name_of_fruit','price']
In [541]: unwanted_cols = list(set(d1_cols) - set(expected_cols))
In [542]: df2 = df1[unwanted_cols]
In [543]: df1.drop(unwanted_cols, 1, inplace=True)

Use groupby along the columns axis to split the DataFrame succinctly. In this case, check whether the columns are in your list to form the grouper, and you can store the results in a dict where the True key gets the DataFrame with the subset of columns in the list and the False key has the subset of columns not in the list.
Sample Data
import pandas as pd
df = pd.DataFrame(data = [[1,2,3]],
columns=['type_of_fruit', 'name_of_fruit', 'price'])
expected_cols = ['name_of_fruit','price']
Code
d = dict(tuple(df.groupby(df.columns.isin(expected_cols), axis=1)))
# If you need to ensure columns are always there then do
#d[True] = d[True].reindex(expected_cols)
d[True]
# name_of_fruit price
#0 2 3
d[False]
# type_of_fruit
#0 1

Comparing two large dataframes w/ pySpark

This post mentions symmetric difference and leveraging code df1.except(df2).union(df2.except(df1)) and/ordf1.unionAll(df2).except(df1.intersect(df2)) but I'm getting syntax errors when using except.
I'm trying to compare two dataframes who can have up to 50 or 50+ columns. I have the working code below but need to avoid hard coding columns.
sample code and example
# Create the two dataframes
df1 = sqlContext.createDataFrame([(11,'Sam',1000,'ind','IT','2/11/2019'),(22,'Tom',2000,'usa','HR','2/11/2019'),
(33,'Kom',3500,'uk','IT','2/11/2019'),(44,'Nom',4000,'can','HR','2/11/2019'),
(55,'Vom',5000,'mex','IT','2/11/2019'),(66,'XYZ',5000,'mex','IT','2/11/2019')],
['No','Name','Sal','Address','Dept','Join_Date'])
df2 = sqlContext.createDataFrame([(11,'Sam',1000,'ind','IT','2/11/2019'),(22,'Tom',2000,'usa','HR','2/11/2019'),
(33,'Kom',3000,'uk','IT','2/11/2019'),(44,'Nom',4000,'can','HR','2/11/2019'),
(55,'Xom',5000,'mex','IT','2/11/2019'),(77,'XYZ',5000,'mex','IT','2/11/2019')],
['No','Name','Sal','Address','Dept','Join_Date'])
df1 = df1.withColumn('FLAG',lit('DF1'))
df2 = df2.withColumn('FLAG',lit('DF2'))
# Concatenate the two DataFrames, to create one big dataframe.
df = df1.union(df2)
#Use window function to check if the count of same rows is more than 1 and if it indeed is, then mark column FLAG as SAME, else keep it the way it is. Finally, drop the duplicates.
my_window = Window.partitionBy('No','Name','Sal','Address','Dept','Join_Date').rowsBetween(-sys.maxsize, sys.maxsize)
df = df.withColumn('FLAG', when((count('*').over(my_window) > 1),'SAME').otherwise(col('FLAG'))).dropDuplicates()
df.show()

You can get all column names from df and use that list as parameter for the Window function:
cols = df.columns
cols.remove('FLAG')
my_window = Window.partitionBy(cols).rowsBetween(-sys.maxsize, sys.maxsize)
The remaining code stays unchanged.

Trying to replace the values of a dataframe with the values of another dataframe

I have two dataframes. One has two important labels that have some associated columns for each label. The second one has the same labels and more useful data for those same labels. I'm trying to replace the values in the first with the values of the second for each appropriate label. For example:
df = {'a':['x','y','z','t'], 'b':['t','x','y','z'], 'a_1':[1,2,3,4], 'a_2':[4,2,4,1], 'b_1':[1,2,3,4], 'b_2':[4,2,4,2]}
df_2 = {'n':['x','y','z','t'], 'n_1':[1,2,3,4], 'n_2':[1,2,3,4]}
I want to replace the values for n_1 and n_2 in a_1 and a_2 for a and b that are the same as n. So far i tried using the replace and map functions, and they work when I use them like this:
df.iloc[0] = df.iloc[0].replace({'a_1':df['a_1']}, df_2['n_1'].loc(df['a'].iloc[0])
I can make the substitution for one specific line, but if I try to put that in a for loop and change the numbers I get the error Cannot index by location index with a non-integer key. If I take the ilocs from there I get the original df unchanged and without any error messages. I get the same behavior when I use the map function. The way i tried to do the for loop and the map:
for i in df:
df.iloc[i] = df.iloc[i].replace{'a_1':df['a_1']}, df_2['n_1'].loc(df['a'].iloc[i])
df.iloc[i] = df.iloc[i].replace{'b_1':df['b_1']}, df_2['n_1'].loc(df['b'].iloc[i])
And so on. And for the map function:
for i in df:
df = df.map(df['b_1']}: df_2['n_1'].loc(df['b'].iloc[i])
df = df.map(df['a_1']}: df_2['n_1'].loc(df['a'].iloc[i])
I would like the resulting dataframe to have the same format as the first but with the values of the second, something like this:
df = {'a':['x','y','z','t'], 'b':['t','x','y','z'], 'an_1':[1,2,3,4], 'an_2':[1,2,3,4], 'bn_1':[1,2,3,4], 'bn_2':[1,2,3,4]}
where an and bn are the values for a and b when n is equal to a or b in the second dataframe.
Hope this is comprehensible.

how can I select data in a multiindex dataFrame and have the result dataFrame have an appropriate index

I have a multiindex DataFrame and I'm trying to select data in it base on certain criteria, so far so good. The problem is that once I have selected my data using .loc and pd.IndexSlice, the resulting DataFrame which should logically have less rows and less element in the first level of the multiindex keeps exactly the same multiIndex but with some keys in it refering to empty dataframe.
I've tried creating a completely new DataFrame with a new index, but the structure of my data set is complicating and there is not always the same number of elements in a given level, so it is not easy to created a dataFrame with the right shape in which I can put the data.
import numpy as np
import pandas as pd
np.random.seed(3) #so my exemple is reproductible
idx = pd.IndexSlice
iterables = [['A','B','C'],[0,1,2],['some','rdm','data']]
my_index = pd.MultiIndex.from_product(iterables,names =
['first','second','third'])
my_columns = ['col1','col2','col3']
df1 = pd.DataFrame(data = np.random.randint(10,size =
(len(my_index),len(my_columns))),
index = my_index,
columns = my_columns
)
#Ok, so let's say I want to keep only the elements in the first level of my index (["A","B","C"]) for
#which the total sum in column 3 is less than 35 for some reasons
boolean_mask = (df1.groupby(level = "first").col3.sum() < 35).tolist()
first_level_to_keep = df1.index.levels[0][boolean_mask].tolist()
#lets select the wanted data and put it in df2
df2 = df1.loc[idx[first_level_to_keep,:,:],:]
So far, everything is as expected
The problem is when I want to access the df2 index. I expected the following:
df2.index.levels[0].tolist() == ['B','C']
to be true. But this is what gives a True statement:
df2.index.levels[0].tolist() == ['A','B','C']
So my question is the following: is there a way to select data and to have in retrun a dataFrame with a multiindex reflecting what is in it. Because I find weird to be able to select non existing data in my df2:
I tried to put some images of the dataframes in question but I couldn't because I dont't have enough «reputation»... sorry about that.
Thank you for your time!

Even if you delete the rows corresponding to a particular value in an index level, that value still exists. You can reset the index and then set those columns back as an index in order to generate a MultiIndex with new level values.
df2 = df2.reset_index().set_index(['first','second','third'])
print(df2.index.levels[0].tolist() == ['B','C'])
True

multiconditional mapping in python pandas

I am looking for a way to do some conditional mapping using multiple comparisons.
I have millions and millions of rows that I am investigating using sample SQL extracts in pandas. Along with SQL extracts read into pandas DataFrames I also have some rules tables, each with a few columns (these are also loaded into dateframes).
This is what I want to do: where a row in my SQL extract matches the conditions expressed in any one row in my rules table, I would like to generate a 1, else: 0. In the end I would like to add a column to my SQL extract called Rule Result with either 1's and 0's.
I have got a system that works using df.merge, but it produces many many extra duplicate rows in the process that must then be removed afterwards. I am looking for a better, faster, more elegant solution and would be grateful for any suggestions.
Here is a working example of the problem and the current solution code:
import pandas as pd
import numpy as np
#Create a set of test data
test_df = pd.DataFrame()
test_df['A'] = [1,120,982,1568,29,455,None, None, None]
test_df['B'] = ['EU','US',None, 'GB','DE','EU','US', 'GB',None]
test_df['C'] = [1111,1121,1111,1821,1131,1111,1121,1821,1723]
test_df['C_C'] = test_df['C']
test_df
test_df
#Create a rules_table
rules_df = pd.DataFrame()
rules_df['A_MIN'] = [0,500,None,600,200]
rules_df['A_MAX'] = [10,1000,500,1200,800]
rules_df['B_B'] = ['GB','GB','US','EU','EU']
rules_df['C_C'] = [1111,1821,1111,1111,None]
rules_df
def map_rules_to_df(df,rules):
#create column that mimics the index to keep track of later duplication
df['IND'] = df.index
#merge the rules with the data on C values
df = df.merge(rules,left_on='C_C',right_on='C_C',how='left')
#create a rule_result_column with a default value of zero
df['RULE_RESULT']=0
#create a mask indentifying those test_df_rows that fit with a
# rule_df_row
mask = df[
((df['A'] > df['A_MIN']) | (df['A_MIN'].isnull())) &
((df['A'] < df['A_MAX']) | (df['A_MAX'].isnull())) &
((df['B'] == df['B_B']) | (df['B_B'].isnull())) &
((df['C'] == df['C_C']) | (df['C_C'].isnull()))
]
#use mask.index to replace 0's in the result column with a 1
df.loc[mask.index.tolist(),'RULE_RESULT']=1
#drop the redundant rule_df columns
df = df.drop(['B_B','C_C','A_MIN','A_MAX'],axis=1)
#drop duplicate rows
df = df.drop_duplicates(keep='first')
#drop rows where the original index is duplicated and the rule result
#is zero
df = df[(df['IND'].duplicated(keep=False)) & (df['RULE_RESULT']==0) == False]
#reset the df index with the original index
df.index = df['IND'].values
#drop the now redundant second index column (IND)
df = df.drop('IND', axis=1)
print('df shape',df.shape)
return df
#map the rules
result_df = map_rules_to_df(test_df,rules_df)
result_df
result_df
I hope I have made what I would like to do clear and thank you for your help.
PS, my rep is non-existent, so i was not allowed to post more than two supporting images.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Featurizer to eliminate features - python

Related

Best way to move an unexpected column in a Pandas DF to a new DF?

Comparing two large dataframes w/ pySpark

Trying to replace the values of a dataframe with the values of another dataframe

how can I select data in a multiindex dataFrame and have the result dataFrame have an appropriate index

multiconditional mapping in python pandas

Categories

Resources