I want to following thing to happen:
for every column in df check if its type is numeric, if not - use label encoder to map str/obj to numeric classes (e.g 0,1,2,3...).
I am trying to do it in the following way:
for col in df:
if not np.issubdtype(df[col].dtype, np.number):
df[col] = LabelEncoder().fit_transform(df[col])
I see few problems here.
First - column names can repeat and thus df[col] returns more than one column, which is not what I want.
Second - df[col].dtype throws error:
AttributeError: 'DataFrame' object has no attribute 'dtype'
which I assume might arise due to the issue #1 , e.g we get multiple columns returned. But I am not confident.
Third - would assigning df[col] = LabelEncoder().fit_transform(df[col]) lead to a column substitution in df or should I do some esoteric df partitioning and concatenation?
Thank you
Since LabelEncoder supports only one column at a time, iteration over columns is your only option. You can make this a little more concise using select_dtypes to select the columns, and then df.apply to apply the LabelEncoder to each column.
cols = df.select_dtypes(exclude=[np.number]).columns
df[cols] = df[cols].apply(lambda x: LabelEncoder().fit_transform(x))
Alternatively, you could build a mask by selecting object dtypes only (a little more flaky but easily extensible):
m = df.dtypes == object
# m = [not np.issubdtype(d, np.number) for d in df.dtypes]
df.loc[:, m] = df.loc[:, m].apply(lambda x: LabelEncoder().fit_transform(x))
Related
usually when you want to remove columns which are not of type float, you can write pd.DataFrame.select_dtypes(include='float64'). however i would like to remove a column in cases where the header name is not a float
df = pd.DataFrame({'a' : [1,2,3], 10 : [3,4,5]})
df.dtypes
will give the output
a int64
10 int64
dtype: object
how can i remove the column a based on the fact that it's not a float or int?
Please Try drop column with digit using regex if you wanted to drop 10
df.filter(regex='\d', axis=1)
#On the contrary, you can drop nondigits too
# df.filter(regex='\D', axis=1)
A solution based on type enumeration:
Code
sr_dtype = df.dtypes
df = df.drop(columns=sr_dtype.index[
sr_dtype.index.map(lambda el: not isinstance(el, (int, float))) # add more if necessary
])
Note that df.types itself is a Series instance that regular Series operations are applicable. In particular, index.map() is used as a wrapper for isinstance() check in this example.
Result
print(df)
10
0 3
1 4
2 5
Are you sure that's the right output? Your dataframe columns are 'a' and 10, why your input has a column named 'b'?
Anyway, to remove the column a, regardless to its type but through its header name, use the drop method:
df = df.drop(columns=['a'])
Also works with a list of columns as well, instead of the single element list in this case.
Based in the other answers, you can also try:
1) To make sure to keep only float and int types:
df[[col for col in df.columns if type(col) in [float,int]]]
2) To just exclude string-like columns:
df.loc[:, [not isinstance(col, str) for col in df.columns]] # return bool array
# or
df[[col for col in df.columns if not isinstance(col, str)]] # return colum names
3) To exclude columns that's not float/int based on regex:
df.filter(regex='^\d+$|^\d+?\.{1}\d+$')
where the first expression ^\d+$ map integers (start and end with digit), and the second expression ^\d+?\.{1}\d+$ maps floats. We could just use ^[\d|\.]+$ (allowing only digits and points) to map both of them, but it would also maps columns like "1..2".
I'm trying to pre-process some data for machine learning purposes. I'm currently trying to clean up some NaN values and replace them with 'unknown' and a prefix or suffix which is based on the column name.
The reason for this is when I'm use one hot encoding, I can't have multiple columns with the same name being fed into xgboost.
So what I have is the following
df = df.apply(lambda x: x.replace(np.nan, 'unknown'))
And I'd like to replace all instances of NaN in the df with 'unknown_columname'. Is there any easy or simple way to do this?
Try df = df.apply(lambda x: x.replace(np.nan, f'unknown_{x.name}')).
You can also use df = df.apply(lambda x: x.fillna(f'unknown_{x.name}').
First let's create the backup array to be filled whenever we have a missing value
s = np.core.defchararray.add('unknown',df.columns.values)
Then we can simply replace each NaN with the right value from s:
cols = df.columns.values
for col_name in cols:
df.col_name.fillna(s, inplace=True)
I have this code which works for one pandas series. How to apply it to all columns of my large dataset? I have tried many solutions, but none works for me.
c = data["High_banks"]
c2 = pd.to_numeric(c.str.replace(',',''))
data = data.assign(High_banks = c2)
What is the best way to do this?
i think you can do it like this
df = df.replace(",","",regex=True )
after that you can convert datatype
You can use a combination of the methods apply and applymap.
Take this for an example:
df = pd.DataFrame([['1,', '2,12'], ['3,356', '4,567']], columns = ['a','b'])
new_df = (df.applymap(lambda x: x.replace(',',''))
.apply(pd.to_numeric, axis = 1))
new_df.dtypes
>> #successfully converted to numeric types
a int64
b int64
dtype: object
The first method, applymap runs element wise on the dataframe to remove , then apply applies the pd.to_numeric function across the column axis of the dataframe.
This is a follow up question to the one asked here.
I have about 200 column names in a dataframe which need to be converted to datetime format.
My intital thought was to create a list of column names, and iterate thru the list, converting them as I go along, and then renaming the columns of the dataframe, using this list of converted names. But from the previous question, I am not sure if I can apply to_datetime to a regular string element. So this method won't work.
Is there anyway to easily convert all columns, or at least, selected columns, with to_datetime?
I do not see an axis to choose in the documentation:
pandas.to_datetime(arg, errors='raise', dayfirst=False, yearfirst=False, utc=None, box=True, format=None, exact=True, unit=None, infer_datetime_format=False, origin='unix')[source]ΒΆ
Function to_datetime working only with Series (column of DataFrame), so possible solution are:
df = df.apply(pd.to_datetime)
#alternative
#df = df.apply(lambda x: pd.to_datetime(x))
Or:
for c in df.columns:
df[c] = pd.to_datetime(df[c])
For convert column names:
df.columns = pd.to_datetime(df.columns)
I have an hierarchical dataset:
df = pd.DataFrame(np.random.rand(6,6),
columns=[['A','A','A','B','B','B'],
['mean', 'max', 'avg']*2],
index=pd.date_range('20000103', periods=6))
I want to apply a function to all values under the columns A. I can set the value to something:
df.loc[slice(None), 'A'] = 1
Easy enough. Now, instead of assigning a value, if I want to apply a mapping to this MultiIndex slice, it does not work.
For example, let me apply a simple formatting statement:
df.loc[slice(None), 'A'].applymap('{:.2f}'.format)
This step works fine. However, I cannot assign this to the original df:
df.loc[slice(None), 'A'] = df.loc[slice(None), 'A'].applymap('{:.2f}'.format)
Everything turns into a NaN. Any help would be appreciated.
You can do it in a couple of ways:
df['A'] = df['A'].applymap('{:.2f}'.format)
or (this will keep the original dtype)
df['A'] = df['A'].round(2)
or as a string
df['A'] = df['A'].round(2).astype(str)