After running a Variance Threshold from Scikit-Learn on a set of data, it removes a couple of features. I feel I'm doing something simple yet stupid, but I'd like to retain the names of the remaining features. The following code:
def VarianceThreshold_selector(data):
selector = VarianceThreshold(.5)
selector.fit(data)
selector = (pd.DataFrame(selector.transform(data)))
return selector
x = VarianceThreshold_selector(data)
print(x)
changes the following data (this is just a small subset of the rows):
Survived Pclass Sex Age SibSp Parch Nonsense
0 3 1 22 1 0 0
1 1 2 38 1 0 0
1 3 2 26 0 0 0
into this (again just a small subset of the rows)
0 1 2 3
0 3 22.0 1 0
1 1 38.0 1 0
2 3 26.0 0 0
Using the get_support method, I know that these are Pclass, Age, Sibsp, and Parch, so I'd rather this return something more like :
Pclass Age Sibsp Parch
0 3 22.0 1 0
1 1 38.0 1 0
2 3 26.0 0 0
Is there an easy way to do this? I'm very new with Scikit Learn, so I'm probably just doing something silly.
Would something like this help? If you pass it a pandas dataframe, it will get the columns and use get_support like you mentioned to iterate over the columns list by their indices to pull out only the column headers that met the variance threshold.
>>> df
Survived Pclass Sex Age SibSp Parch Nonsense
0 0 3 1 22 1 0 0
1 1 1 2 38 1 0 0
2 1 3 2 26 0 0 0
>>> from sklearn.feature_selection import VarianceThreshold
>>> def variance_threshold_selector(data, threshold=0.5):
selector = VarianceThreshold(threshold)
selector.fit(data)
return data[data.columns[selector.get_support(indices=True)]]
>>> variance_threshold_selector(df, 0.5)
Pclass Age
0 3 22
1 1 38
2 3 26
>>> variance_threshold_selector(df, 0.9)
Age
0 22
1 38
2 26
>>> variance_threshold_selector(df, 0.1)
Survived Pclass Sex Age SibSp
0 0 3 1 22 1
1 1 1 2 38 1
2 1 3 2 26 0
I came here looking for a way to get transform() or fit_transform() to return a data frame, but I suspect it's not supported.
However, you can subset the data a bit more cleanly like this:
data_transformed = data.loc[:, selector.get_support()]
There's probably better ways to do this, but for those interested here's how I did:
def VarianceThreshold_selector(data):
#Select Model
selector = VarianceThreshold(0) #Defaults to 0.0, e.g. only remove features with the same value in all samples
#Fit the Model
selector.fit(data)
features = selector.get_support(indices = True) #returns an array of integers corresponding to nonremoved features
features = [column for column in data[features]] #Array of all nonremoved features' names
#Format and Return
selector = pd.DataFrame(selector.transform(data))
selector.columns = features
return selector
As I had some problems with the function by Jarad, I have mixed it up with the solution by pteehan, which I found is more reliable. I also added NA replacement as a standard as VarianceThreshold does not like NA values.
def variance_threshold_select(df, thresh=0.0, na_replacement=-999):
df1 = df.copy(deep=True) # Make a deep copy of the dataframe
selector = VarianceThreshold(thresh)
selector.fit(df1.fillna(na_replacement)) # Fill NA values as VarianceThreshold cannot deal with those
df2 = df.loc[:,selector.get_support(indices=False)] # Get new dataframe with columns deleted that have NA values
return df2
how about this as a code?
columns = [col for col in df.columns]
low_var_cols = []
for col in train_file.columns:
if statistics.variance(df[col]) <= 0.1:
low_var_cols.append(col)
then drop the columns from the dataframe?
You can use Pandas for thresholding too
data_new = data.loc[:, data.std(axis=0) > 0.75]
Related
I have a dataframe with 9 columns, two of which are gender and smoker status. Every row in the dataframe is a person, and each column is their entry on a particular trait.
I want to count the number of entries that satisfy the condition of being both a smoker and is male.
I have tried using a sum function:
maleSmoke = sum(1 for i in data['gender'] if i is 'm' and i in data['smoker'] if i is 1 )
but this always returns 0. This method works when I only check one criteria however and I can't figure how to expand it to a second.
I also tried writing a function that counted its way through every entry into the dataframe but this also returns 0 for all entries.
def countSmokeGender(df):
maleSmoke = 0
femaleSmoke = 0
maleNoSmoke = 0
femaleNoSmoke = 0
for i in range(20000):
if df['gender'][i] is 'm' and df['smoker'][i] is 1:
maleSmoke = maleSmoke + 1
if df['gender'][i] is 'f' and df['smoker'][i] is 1:
femaleSmoke = femaleSmoke + 1
if df['gender'][i] is 'm' and df['smoker'][i] is 0:
maleNoSmoke = maleNoSmoke + 1
if df['gender'][i] is 'f' and df['smoker'][i] is 0:
femaleNoSmoke = femaleNoSmoke + 1
return maleSmoke, femaleSmoke, maleNoSmoke, femaleNoSmoke
I've tried pulling out the data sets as numpy arrays and counting those but that wasn't working either.
Are you using pandas?
Assuming you are, you can simply do this:
# How many male smokers
len(df[(df['gender']=='m') & (df['smoker']==1)])
# How many female smokers
len(df[(df['gender']=='f') & (df['smoker']==1)])
# How many male non-smokers
len(df[(df['gender']=='m') & (df['smoker']==0)])
# How many female non-smokers
len(df[(df['gender']=='f') & (df['smoker']==0)])
Or, you can use groupby:
df.groupby(['gender'])['smoker'].sum()
Another alternative, which is great for data exploration: .pivot_table
With a DataFrame like this
id gender smoker other_trait
0 0 m 0 0
1 1 f 1 1
2 2 m 1 0
3 3 m 1 1
4 4 f 1 0
.. .. ... ... ...
95 95 f 0 0
96 96 f 1 1
97 97 f 0 1
98 98 m 0 0
99 99 f 1 0
you could do
result = df.pivot_table(
index="smoker", columns="gender", values="id", aggfunc="count"
)
to get a result like
gender f m
smoker
0 32 16
1 27 25
If you want to display the partial counts you can add the margins=True option and get
gender f m All
smoker
0 32 16 48
1 27 25 52
All 59 41 100
If you don't have a column to count over (you can't use smoker and gender because they are used for the labels) you could add a dummy column:
result = df.assign(dummy=1).pivot_table(
index="smoker", columns="gender", values="dummy", aggfunc="count",
margins=True
)
I would like to extract all distinct row values from specific columns and create new columns and calculate their frequency in every row.
My Input Dataframe is:
import pandas as pd
data = {'user_id': ['abc','def','ghi'],
'alpha': ['A','B,C,D,A','B,C,A'],
'beta': ['1|20|30','350','376']}
df = pd.DataFrame(data = data, columns = ['user_id','alpha','beta'])
print(df)
Looks like this,
user_id alpha beta
0 abc A 1|20|30
1 def B,C,D,A 350
2 ghi B,C,A 376
I want something like this,
user_id alpha beta A B C D 1 20 30 350 376
0 abc A 1|20|30 1 0 0 0 1 1 1 0 0
1 def B,C,D,A 350 1 1 1 1 0 0 0 1 0
2 ghi B,C,A 376 1 1 1 0 0 0 0 0 1
My original data contains 11K rows. And these distinct values in alpha & beta are around 550.
I created a list from all the values in alpha & beta columns and applied pd.get_dummies but it results in a lot of rows. I would like all the rows to be rolled up based on user_id.
A similar idea is used by CountVectorizer on documents, where it creates columns based on all the words in the sentence and checks the frequency of a word. However, I am guessing Pandas has a better and efficient way to do that.
Grateful for all your assistance. :)
Desired Output:
You can use Series.str.get_dummies to create a dummy indicator dataframe for each of the columns alpha and beta then using pd.concat concat these dataframes along axis=1:
cs = (('alpha', ','), ('beta', '|'))
df1 = pd.concat([df] + [df[c].str.get_dummies(sep=s) for c, s in cs], axis=1)
Result:
print(df1)
user_id alpha beta A B C D 1 20 30 350 376
0 abc A 1|20|30 1 0 0 0 1 1 1 0 0
1 def B,C,D,A 350 1 1 1 1 0 0 0 1 0
2 ghi B,C,A 376 1 1 1 0 0 0 0 0 1
I have a df with raw survey data similar to the following with 12000 rows and forty questions. All responses are categorical
import pandas as pd
df = pd.DataFrame({'Age' : ['20-30','20-30','30-45', '20-30','30-45','20-30'],
'Gender' : ['M', 'F', 'F','F','M','F'],
'Income' : ['20-30k', '30-40k', '40k+', '40k+', '40k+', '20-30k'],
'Question1' : ['Good','Bad','OK','OK','Bad','Bad'],
'Question2' : ['Happy','Unhappy','Very_Unhappy','Very_Unhappy','Very_Unhappy','Happy']})
I want to categorize the responses to each question according to Age, Gender and Income, to produce a frequency (by %) table for each question that looks like this screenshot showing questions.
Crosstab produces too many categories ie it breaks down by income and within income, by age etc. So I'm not sure how best to go about this. I'm sure this is an easy problem but I'm new to python to any help would be appreciated
As you said, using cross tab for all the columns breaks down the result by each column. You can use individual cross tabs and then concat
pd.concat([pd.crosstab(df.Question1, df.Gender), pd.crosstab(df.Question1, df.Income), pd.crosstab(df.Question1, df.Age)], axis = 1)
F M 20-30k 30-40k 40k+ 20-30 30-45
Question1
Bad 2 1 1 1 1 2 1
Good 0 1 1 0 0 1 0
OK 2 0 0 0 2 1 1
Edit: To get additional level at columns
age = pd.crosstab(df.Question1, df.Age)
age.columns = pd.MultiIndex.from_product([['Age'], age.columns])
gender = pd.crosstab(df.Question1, df.Gender)
gender.columns = pd.MultiIndex.from_product([['Gender'], gender.columns])
income = pd.crosstab(df.Question1, df.Income)
income.columns = pd.MultiIndex.from_product([['Income'], income.columns])
pd.concat([age, gender, income], axis = 1)
Age Gender Income
20-30 30-45 F M 20-30k 30-40k 40k+
Question1
Bad 2 1 2 1 1 1 1
Good 1 0 0 1 1 0 0
OK 1 1 2 0 0 0 2
You can do melt before crosstab
s=df.drop('Question2',1).\
melt(['Age','Gender','Income']).drop('variable',1).\
rename(columns={'value':'v1'}).melt('v1')
pd.crosstab(s.v1,[s.variable,s.value])
Out[235]:
variable Age Gender Income
value 20-30 30-45 F M 20-30k 30-40k 40k+
v1
Bad 2 1 2 1 1 1 1
Good 1 0 0 1 1 0 0
OK 1 1 2 0 0 0 2
It was hard for me to come up with clear title but an example should make things more clear.
Index C1
1 [dinner]
2 [brunch, food]
3 [dinner, fancy]
Now, I'd like to create a set of binary features for each of the unique values in this column.
The example above would turn into:
Index C1 dinner brunch fancy food
1 [dinner] 1 0 0 0
2 [brunch, food] 0 1 0 1
3 [dinner, fancy] 1 0 1 0
Any help would be much appreciated.
For a performant solution, I recommend creating a new DataFrame by listifying your column.
pd.get_dummies(pd.DataFrame(df.C1.tolist()), prefix='', prefix_sep='')
brunch dinner fancy food
0 0 1 0 0
1 1 0 0 1
2 0 1 1 0
This is going to be so much faster than apply(pd.Series).
This works assuming lists don't have more of the same value (eg., ['dinner', ..., 'dinner']). If they do, then you'll need an extra groupby step:
(pd.get_dummies(
pd.DataFrame(df.C1.tolist()), prefix='', prefix_sep='')
.groupby(level=0, axis=1)
.sum())
Well, if your data is like this, then what you're looking for isn't "binary" anymore.
Maybe using MultiLabelBinarizer
from sklearn.preprocessing import MultiLabelBinarizer
mlb = MultiLabelBinarizer()
pd.DataFrame(mlb.fit_transform(df.C1),columns=mlb.classes_,index=df.Index).reset_index()
Out[970]:
Index brunch dinner fancy food
0 1 0 1 0 0
1 2 1 0 0 1
2 3 0 1 1 0
i have train dataset which has 12 columns.
I want to select Cabin column rows according to Pclass column's value 1.
And then replace value of selected rows of Cabin column with 1.
i did following code but it replace all values of cabin column with 1 even NaN values replace by 1.How i can replace only selected rows?
train['Cabin'] =train[train['Pclass']==1]['Cabin']=1
You can select by loc with condition rows of column Cabin and set to scalar:
train.loc[train['Pclass'] == 1, 'Cabin'] = 1
And your code replace all values to 1 because is is same as:
train['Cabin'] = 1
Sample:
train = pd.DataFrame({'Pclass':[1,2,3,1,2],
'Cabin':[10,20,30,40,50]})
print (train)
Cabin Pclass
0 10 1
1 20 2
2 30 3
3 40 1
4 50 2
train.loc[train['Pclass'] == 1, 'Cabin'] = 1
print (train)
Cabin Pclass
0 1 1
1 20 2
2 30 3
3 1 1
4 50 2
you can directly filter the rows you want to change and assign the value to it instead of filtering, replacing and then assigning to the dataframe.
So
train['Cabin'] =train[train['Pclass']==1]['Cabin']=1
becomes
train['Cabin'][train['Pclass']==1] = 1