I have a feature called smoking_status it has 3 different values :
1) smokes
2) formerly smoked
3) never smoked
The feature column (smoking_status) has above 3 values as well as lot of NaN values how can I treat the NaN values because my data is not numerical, if it was numerical I could have replaced it using median or mean. How can I replace NaN values in my case ?
There might be two better options than replacing NaN with unknown - at least in the context of a data science challenge which I think this is:
replace this with the most common value (mode).
predict the missing value using the data you have
Getting the most common value is easy. For this purpos you can use <column>.value_counts() to get the frequencies followed by a .idxmax() which gives you the index element from value_counts() with the highes frequency. After that you just call fillna():
import pandas as pd
import numpy as np
df = pd.DataFrame(['formerly', 'never', 'never', 'never',
np.nan, 'formerly', 'never', 'never',
np.nan, 'never', 'never'], columns=['smoked'])
print(df)
print('--')
print(df.smoked.fillna(df.smoked.value_counts().idxmax()))
Gives:
smoked
0 formerly
1 never
2 never
3 never
4 NaN
5 formerly
6 never
7 never
8 NaN
9 never
10 never
--
0 formerly
1 never
2 never
3 never
4 never
5 formerly
6 never
7 never
8 never
9 never
10 never
You don't have the data for those rows. You can simply fill it median or mean, most common value in that feature. But in this perticular case that's a bad idea considering the feature.
A better approach would be to fill with a string saying 'unknown'/'na'
df['smoking_status'].fillna('NA')
Then you can label encode it or convert the column to one hot encoding.
Example categorical data:
ser = pd.Categorical(['non', 'non', 'never', 'former', 'never', np.nan])
Fill it:
ser.add_categories(['unknown']).fillna('unknown')
Gives you:
[non, non, never, former, never, unknown]
Categories (4, object): [former, never, non, unknown]
Looks like the question is about methodology, not technical issue.
So you can try
1) The most frequent value among those three;
2) Use some other categorical fields statistics of your dataset (e.g. group most common smoking status);
3) Random values;
4) "UNKNOWN" category
Then you can do one-hot-encoding and definitely check your models on cross-validation to chose the proper way.
Also there is more tricky way: use this status as a target variable and try to predict those NaNs with scikit using all other data.
Related
I am working on an SMS dataset that has two columns a "label column" which is consists of "ham/spam" and another column with "messages" consist of a bunch of strings.
I converted the "Label column" to numeric labels, ham=1, and spam=0
#Converting our labels to numeric labels
# ham = 0 and spam = 1
dfcat = dataset['label']=dataset.label.map({'ham':1,'spam':0})
dfcat.head()
so when I run the above code the first time it gave me the exact thing am looking for but after I ran it again it started giving me "Nan".
Out[108]:
0 NaN
1 NaN
2 NaN
3 NaN
4 NaN
Name: label, dtype: float64
Please, I need a way to fix this.
#G. Anderson gave the reason why are you seeing those NaN the second time you rerun it.
As for a way to handle categorical variables in Python, one could use one hot encoding. Toy example below:
import pandas as pd
df = pd.DataFrame({"col1": ["a", "b", "c"], "label": ["ham", "spam", "ham"]})
df_ohe = pd.get_dummies(df, prefix="ohe", drop_first=True, columns=["label"])
df_ohe
However, it also depends on the amount of categorical variables and their cardinality (if high, one hot encoding might not be the best approach).
The behavior of the series.map() function is to replace the values in the provided dictionary, and change other values to NaN. If you want to run the same line of code multiple times, then all values need to be accounted for. You can either use a defaultdict, which allows a default value to be set, or just include the result of the first run as inputs in case you run it a second time.
Change
dfcat = dataset['label']=dataset.label.map({'ham':1,'spam':0})
to
dfcat = dataset['label']=dataset.label.map({'ham':1,'spam':0,1:1,0:0})
I am cleaning a csv file on jupyter to do machine learning.
However, several columns have string values, like the column "description":
I know I need to use NLP to clean, but could not find how to do it on jupyter.
Could you advice me how to convert these values to numeric values?
Thank you
Numerical values are better for creating learning models than words or images.(Why? dimensionality reduction)
Common machine learning algorithms expect a numerical input.
The technique used to convert a word to a corresponding numerical value is called word embedding.
In word embedding, strings are converted to feature vectors(numbers).
Bag of words, word2vec, GloVe can be used for implementing this.
It is generally advisable to ignore those fields which wouldn't be significant for the model.So include description only if is absolutely essential.
The problem you are describing is that of converting categorical data, usually in the form of strings or numerical ID's to purely numerical data. I'm sure you are aware that using numerical ID's has a problem: it leads to the false interpretation that the data has some sort of order. Like apple < orange < lime, when this is not the case.
It is common to use one-hot encoding to produce numerical indicator variables. After encoding one column, you have N columns, where N is the amount of unique labels. The columns have a value of 1 when the corresponding categorical variable had that value and 0 otherwise. This is especially handy if there are few unique labels in one column. Both Pandas and sklearn have these sorts of functions available, albeit they are not as feature complete as one would hope.
The "description" column you have seems to be a bit trickier, because it actually includes language, not just categorical data. So that column would need to be parsed or handled in some other way. Although, the one-hot encoding scheme may very well be used for all the words in the description, producing a vector that has more 1's.
For example:
>>> import pandas as pd
>>> df = pd.DataFrame(['a', 'b', 'c', 'a', 'a', pd.np.nan])
>>> pd.get_dummies(df)
0_a 0_b 0_c
0 1 0 0
1 0 1 0
2 0 0 1
3 1 0 0
4 1 0 0
5 0 0 0
Additional processing would be needed to get the encoding word by word. This approach considers only the full values as variables.
I've been trying without success to find a way to create an "average_gain_up" in python and have gotten a bit stuck. Being new to groupby there is something of how it is treating functions that i've not managed to grasp so any intuition behind how to think through these types of problems would be helpful.
Problem:
Create a rolling 14 day sum, only summing if the value is >0 .
new=pd.DataFrame([[1,-2,3,-2,4,5],['a','a','a','b','b','b']])
new= new.T #transposing into a friendly groupby format
#Group by a or b, filter to only have positive values and then sum rolling, we
keep NAs to ensure the sum is ran over 14 values.
groupby=new.groupby(1)[0].filter(lambda x: x>0,dropna=False).rolling(14).sum()
Intended Sum Frame:
x.all()/len(x) result:
this throws a type error "the filter must return a boolean result" .
from reading other answers, I understand as i'm asking if a series/frame is superior to 0 .
The above code works with len(x), again makes sense in that context.
i tried with all() as well but it doesn't behave as intended. the .all() functions returns a single boolean per group and the sum is then just a simple rolling sum.
i've tried creating a list of booleans to say which values are positive and which are not but that also yields an error, this time i'm not sure why.
groupby1=new.groupby(1)[0]
groupby2=[y>0 for x in groupby1 for y in x[1] ]
groupby_try=new.groupby(1)[0].filter(lambda x:groupby2,dropna=False).rolling(2).sum()
1) how do i make the above code work and what is wrong in how i am thinking about it ?
2) is this the "best Practice" way to do these types of operations ?
any help appreciated, let me know if i've missed anything or any further clarification is needed.
According to the doc on filter after a groupby, it is not supposed to filter values within a group but groups as a whole if they don't meet some criteria, such as if the sum of all the elements of the group is above 2 then the group is kept in the first example given
One way could be to replace all the negative values by 0 in new[0] first, using np.clip for example, and then groupby, rolling and sum such as
print (np.clip(new[0],0,np.inf).groupby(new[1]).rolling(2).sum())
1
a 0 NaN
1 1.0
2 3.0
b 3 NaN
4 4.0
5 9.0
Name: 0, dtype: float64
This way prevents from modifying the data in new, if you don't mind you can change the column 0 with new[0] = np.clip(new[0],0,np.inf) and then do new.groupby(1)[0].rolling(2).sum() which give the same result.
I have pandas.dataFrame with column 'Country', head() is below:
0 tmp
1 Environmental Indicators: Energy
2 tmp
3 Energy Supply and Renewable Electricity Produc...
4 NaN
5 NaN
6 NaN
7 Choose a country from the following drop-down ...
8 NaN
9 Country
When I use this line:
energy['Country'] = energy['Country'].str.replace(r'[...]', 'a')
There is no change.
But when I use this line instead:
energy['Country'] = energy['Country'].str.replace(r'[...]', np.nan)
All values are NaN.
Why does only second code change output? My goal is change valuses with triple dot only.
Is this what you want when you say "I need change whole values, not just the triple dots"?
mask = df.Country.str.contains(r'\.\.\.', na=False)
df.Country[mask] = 'a'
.replace(r'[...]', 'a') treats the first parameter as a regular expression, but you want to treat it literally. So, you need .replace(r'\.\.\.', 'a').
As for your actual question, .str.replace requires a string as the second parameter. It attempts to convert np.nan to a string (which is not possible) and fails. For the reason not known to me, instead of raising a TypeError, it instead returns np.nan for each row.
I have a data file with a fields separated by commas that I received from someone. I have to systematically go through each column to understand things like usual descriptive statistics:
-Min
-Max
-Mean
-25th percentile
-50th percentile
-75th percentile
or if it's text:
-number of distinct values
but also I need to find
-number of null or missing values
-number of zeroes
Sometimes the oddities of a feature mean something, i.e. contains information. And I might need to circle back with the client about oddities I find. Or if I'm going to replace values I have to make sure I'm not steamrolling over something recklessly.
So my question is this: Is there a package in python that will find this for me without my presupposing the data type? And if it did exist, would pandas be a good home for it?
I see that pandas makes it easy peezy to replace values but in the beginning I just want to look.
You can use the describe method:
In [1]: df = pd.DataFrame(randn(10, 3), columns=list('ABC'))
In [2]: df
Out[2]:
A B C
0 1.389738 -0.205485 -0.775810
1 -1.166596 -0.898761 -1.805333
2 -1.016509 -0.816037 0.169265
3 -0.440860 -1.147164 1.558606
4 0.763012 1.068694 -0.711795
5 0.075961 -0.597715 0.699023
6 3.006095 -0.354879 -0.718440
7 -1.249588 -0.372235 1.611717
8 0.518770 -0.742766 1.956372
9 1.304080 -0.803262 -0.609970
In [3]: df.describe()
Out[3]:
A B C
count 10.000000 10.000000 10.000000
mean 0.318410 -0.486961 0.137363
std 1.360633 0.616566 1.266616
min -1.249588 -1.147164 -1.805333
25% -0.872596 -0.812843 -0.716779
50% 0.297366 -0.670240 -0.220352
75% 1.168813 -0.359218 1.343710
max 3.006095 1.068694 1.956372
It has a percentile_width argument, which defaults to 50.