Need to strip CSV Column Number Data of Letters - Pandas - python

I am working on a .csv that has columns in which numerical data includes letters. I want to strip the letters so that the column can be a float or int.
I have tried the following:
using the loop/def process to strip object columns of string data, in "MPG" column and leave only numerical values.
it should print the names of the columns where there is at least one entry ending in the characters 'mpg'
CODING IN JUPYTER NOTEBOOK CELLS:
Step 1:
MPG_cols = []
for colname in df.columns[df.dtypes == 'object']:
if df[colname].str.endswith('mpg').any():
MPG_cols.append(colname)
print(MPG_cols)
using .str so I can use an element-wise string method
only want to consider the string columns
THIS GIVES ME OUTPUT:
[Power]. #good so far
STEP 2:
#define the value to be removed using loop
def remove_mpg(pow_val):
"""For each value, take the number before the 'mpg'
unless it is not a string value. This will only happen
for NaNs so in that case we just return NaN.
"""
if isinstance(pow_val, str):
i=pow_val.replace('mpg', '')
return float(pow_val.split(' ')[0])
else:
return np.nan
position_cols = ['Vehicle_type']
for colname in MPG_cols:
df[colname] = df[colname].apply(remove_mpg)
df[Power_cols].head()
The Error I get:
ValueError Traceback (most recent call last)
<ipython-input-37-45b7f6d40dea> in <module>
15
16 for colname in MPG_cols:
---> 17 df[colname] = df[colname].apply(remove_mpg)
18
19 df[MPG_cols].head()
~/opt/anaconda3/lib/python3.8/site-packages/pandas/core/series.py in apply(self, func, convert_dtype, args, **kwds)
3846 else:
3847 values = self.astype(object).values
-> 3848 mapped = lib.map_infer(values, f, convert=convert_dtype)
3849
3850 if len(mapped) and isinstance(mapped[0], Series):
pandas/_libs/lib.pyx in pandas._libs.lib.map_infer()
<ipython-input-37-45b7f6d40dea> in remove_mpg(pow_val)
8 if isinstance(pow_val, str):
9 i=pow_val.replace('mpg', '')
---> 10 return float(pow_val.split(' ')[0])
11 else:
12 return np.nan
ValueError: could not convert string to float: 'null'
I applied similar code to a different column and it worked on that column, but not here.
Any guidance will be greatly appreciated.
Bests,

i think you would need to revisit the logic for function remove_mpg one way you can tweak as follows:
import re
import numpy as np
def get_me_float(pow_val):
my_numbers = re.findall(r"(\d+.*\d+)mpg", pow_val)
if len(my_numbers) > 0 :
return float(my_numbers[0])
else:
return np.nan
for example, need to test the function.
my_pow_val=['34mpg','34.6mpg','0mpg','mpg','anything']
for each_pow in my_pow_val:
print(get_me_float(each_pow))
output:
34.0
34.6
nan
nan
nan

This will work,
import pandas as pd
pd.to_numeric(pd.Series(['$2', '3#', '1mpg']).str.replace('[^0-9]', '', regex=True))
0 2
1 3
2 1
dtype: int64
for complete solution,
for i in range(df.shape[1]):
if(df.iloc[:,i].dtype == 'object'):
df.iloc[:,i] = pd.to_numeric(df.iloc[:,i].str.replace('[^0-9]', '', regex=True))
df.dtypes
Select columns not to be changed
for i in range(df.shape[1]):
# 'colA', 'colB' are columns which should remain same.
if((df.iloc[:,i].dtype == 'object') & df.column[i] not in ['colA','colB']):
df.iloc[:,i] = pd.to_numeric(df.iloc[:,i].str.replace('[^0-9]', '', regex=True))
df.dtypes

Why don't you use the converters paremeter to the read_csv function to strip the extra characters when you load the csv file?
def strip_mpg(s):
return float(s.rstrip(' mpg'))
df = read_csv(..., converters={'Power':strip_mpg}, ...)

Related

Pandas Sort File and group up values

I'm learning pandas,but having some trouble.
I import data as DataFrame and want to bin the 2017 population values into four equal-size groups.
And count the number of group4
However the system print out:
---------------------------------------------------------------------------
KeyError Traceback (most recent call last)
<ipython-input-52-05d9f2e7ffc8> in <module>
2
3 df=pd.read_excel('C:/Users/Sam/Desktop/商業分析/Python_Jabbia1e/Chapter 2/jaggia_ba_1e_ch02_Data_Files.xlsx',sheet_name='Population')
----> 4 df=df.sort_values('2017',ascending=True)
5 df['Group'] = pd.qcut(df['2017'], q = 4, labels = range(1, 5))
6 splitData = [group for _, group in df.groupby('Group')]
C:\ProgramData\Anaconda3\lib\site-packages\pandas\core\frame.py in sort_values(self, by, axis, ascending, inplace, kind, na_position, ignore_index, key)
5453
5454 by = by[0]
-> 5455 k = self._get_label_or_level_values(by, axis=axis)
5456
5457 # need to rewrap column in Series to apply key function
C:\ProgramData\Anaconda3\lib\site-packages\pandas\core\generic.py in _get_label_or_level_values(self, key, axis)
1682 values = self.axes[axis].get_level_values(key)._values
1683 else:
-> 1684 raise KeyError(key)
1685
1686 # Check for duplicates
KeyError: '2017'
What's wrong with it?
Thanks~
Here's the dataframe:
And I tried:
df=pd.read_excel('C:/Users/Sam/Desktop/商業分析/Python_Jabbia1e/Chapter 2/jaggia_ba_1e_ch02_Data_Files.xlsx',sheet_name='Population')
df=df.sort_values('2017',ascending=True)
df['Group'] = pd.qcut(df['2017'], q = 4, labels = range(1, 5))
splitData = [group for _, group in df.groupby('Group')]
print('The number of group4 is :',splitData[3].shape[0])
You are inserting the key for df.sort_values() as a str. You can either give it as an element in a list or not.
df = df.sort_values(by=['2017'], ascending=True)
or
df = df.sort_values(by='2017', ascending=True)
This only works if the column value is exactly matching the string you pass. If it is not a string or if that string contains some white spaces it won't work. You can remove any trailing white spaces before sorting by,
df.columns = df.columns.str.strip()
and if it is not a string you should use,
df = df.sort_values(by=[2017], ascending=True)
Firstly, you have problem in 4 line with the sort, you tell sort function to look for string 2017, but it's integer. Try this then move on on your code:
df=df.sort_values([2017],ascending=True)

Converting dataframe column of mixed types to int, ignore values with non numeric characters

df:
IDs
0 text
1 001
2 1
df = pd.DataFrame({'IDs': ['text', '001', '1']})
And I'd like to convert the values to int where possible so strings corresponding to the same entity, 001 and 1, become identical values, through dropping the '00' prefix.
This is demonstrated in pandas documentation, but neither df['IDs'] = pd.to_numeric(df['IDs'], errors='ignore') or df['IDs'] = df['IDs'].astype(int, errors='ignore') is changing anything.
What am I doing wrong?
It is expected, docs to_numeric say:
If ‘ignore’, then invalid parsing will return the input.
so it means if invalid at least one value it return same values.
Possible solution is use custom function with try-except:
df = pd.DataFrame({'IDs': ['text', '001', '1']})
def func(x):
try:
return int(x)
except:
return x
df['IDs'] = df['IDs'].apply(func)
print (df)
IDs
0 text
1 1
2 1

label-encoder encoding a dataframe without encoding NaN missing values

I have a dataframe that contains Numerical, categorical and NaN values.
customer_class B C
0 OM1 1 2.0
1 NaN 6 1.0
2 OM1 9 NaN
....
I need a LabelEncoder that keeps my missing values as 'NaN' to use an Imputer afterwards.
So I have would like to use this code in order to encode my dataframe by keeping NaN value .
here is the code :
class LabelEncoderByCol(BaseEstimator, TransformerMixin):
def __init__(self,col):
#List of column names in the DataFrame that should be encoded
self.col = col
#Dictionary storing a LabelEncoder for each column
self.le_dic = {}
for el in self.col:
self.le_dic[el] = LabelEncoder()
def fit(self,x,y=None):
#Fill missing values with the string 'NaN'
x[self.col] = x[self.col].fillna('NaN')
for el in self.col:
#Only use the values that are not 'NaN' to fit the Encoder
a = x[el][x[el]!='NaN']
self.le_dic[el].fit(a)
return self
def transform(self,x,y=None):
#Fill missing values with the string 'NaN'
x[self.col] = x[self.col].fillna('NaN')
for el in self.col:
#Only use the values that are not 'NaN' to fit the Encoder
a = x[el][x[el]!='NaN']
#Store an ndarray of the current column
b = x[el].get_values()
#Replace the elements in the ndarray that are not 'NaN'
#using the transformer
b[b!='NaN'] = self.le_dic[el].transform(a)
#Overwrite the column in the DataFrame
x[el]=b
#return the transformed D
col = data1['customer_class']
LabelEncoderByCol(col)
LabelEncoderByCol.fit(x=col,y=None)
But I got this error :
846 if mask.any():
--> 847 raise ValueError('%s not contained in the index' % str(key[mask]))
848 self._set_values(indexer, value)
849
ValueError: ['OM1' 'OM1' 'OM1' ... 'other' 'EU' 'EUB'] not contained in the index
Any idea please to resolve this error?
thanks
Two things jumped out to me when I tried to reproduce:
Your code seems to expect a dataframe will be passed to your class. But in your example you passed a series. I fixed this by wrapping the series as a dataframe before passing it to your class: col = pd.DataFrame(data1['customer_class']).
In your class' __init__ method it seemed like you had intended to iterate through a list of column names, but instead were actually iterating through all of your columns, series by series. I fixed this by changing the appropriate line to: self.col = col.columns.values.
Below, I've pasted in my modifications to your class' __init__ and fit methods (my only modification to the transform method was to have it return the modified dataframe):
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.preprocessing import LabelEncoder
data1 = pd.DataFrame({'customer_class': ['OM1', np.nan, 'OM1'],
'B': [1,6,9],
'C': [2.0, 1.0, np.nan]})
class LabelEncoderByCol(BaseEstimator, TransformerMixin):
def __init__(self,col):
#List of column names in the DataFrame that should be encoded
self.col = col.columns.values
#Dictionary storing a LabelEncoder for each column
self.le_dic = {}
for el in self.col:
self.le_dic[el] = LabelEncoder()
def fit(self,x,y=None):
#Fill missing values with the string 'NaN'
x = x.fillna('NaN')
for el in self.col:
#Only use the values that are not 'NaN' to fit the Encoder
a = x[el][x[el]!='NaN']
self.le_dic[el].fit(a)
return self
def transform(self,x,y=None):
#Fill missing values with the string 'NaN'
x[self.col] = x[self.col].fillna('NaN')
for el in self.col:
#Only use the values that are not 'NaN' to fit the Encoder
a = x[el][x[el]!='NaN']
#Store an ndarray of the current column
b = x[el].get_values()
#Replace the elements in the ndarray that are not 'NaN'
#using the transformer
b[b!='NaN'] = self.le_dic[el].transform(a)
#Overwrite the column in the DataFrame
x[el]=b
return x
I am able to run the following lines (also slightly modified from your initial implementation) with no error:
col = pd.DataFrame(data1['customer_class'])
lenc = LabelEncoderByCol(col)
lenc.fit(x=col,y=None)
I can then access the classes for the customer_class column from your example:
lenc.fit(x=col,y=None).le_dic['customer_class'].classes_
Which outputs:
array(['OM1'], dtype=object)
Finally, I can transform the column using your class' transform method:
lenc.transform(x=col,y=None)
Which outputs the following:
customer_class
0 0
1 NaN
2 0

Pandas Error Matching String

I have data like the SampleDf data below. I'm trying to check values in one column in my dataframe to see if they contain 'sum' or 'count' or 'Avg' and then create a new column with the value 'sum', 'count', or 'Avg'. When I run the code below on my real dataframe I'm getting the error below. When I run dtypes on my real dataframe it says all the columns are objects. The code below is related to the post below. Unfortunately I don't get the same errors when I run the code on the SampleDf I've provided, but I couldn't post my whole dataframe.
post:
Pandas and apply function to match a string
Code:
SampleDf=pd.DataFrame([['tom',"Avg(case when Value1 in ('Value2') and [DateType] in ('Value3') then LOS end)"],['bob',"isnull(Avg(case when XferToValue2 in (1) and DateType in ('Value3') and [Value1] in ('HM') then LOS end),0)"]],columns=['ReportField','OtherField'])
search1='Sum'
search2='Count'
search3='Avg'
def Agg_type(x):
if search1 in x:
return 'sum'
elif search2 in x:
return 'count'
elif search3 in x:
return 'Avg'
else:
return 'Other'
SampleDf['AggType'] = SampleDf['OtherField'].apply(Agg_type)
SampleDf.head()
Error:
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-17-a2b4920246a7> in <module>()
17 return 'Other'
18
---> 19 SampleDf['AggType'] = SampleDf['OtherField'].apply(Agg_type)
20
21 #SampleDf.head()
C:\Users\Name\AppData\Local\Continuum\Anaconda3\lib\site-packages\pandas\core\series.py in apply(self, func, convert_dtype, args, **kwds)
2292 else:
2293 values = self.asobject
-> 2294 mapped = lib.map_infer(values, f, convert=convert_dtype)
2295
2296 if len(mapped) and isinstance(mapped[0], Series):
pandas\src\inference.pyx in pandas.lib.map_infer (pandas\lib.c:66124)()
<ipython-input-17-a2b4920246a7> in Agg_type(x)
8
9 def Agg_type(x):
---> 10 if search1 in x:
11 return 'sum'
12 elif search2 in x:
TypeError: argument of type 'float' is not iterable
You can try this:
SampleDf['new_col'] = np.where(SampleDf.OtherField.str.contains("Avg"),"Avg",
np.where(SampleDf.OtherField.str.contains("Count"),"Count",
np.where(SampleDf.OtherField.str.contains("Sum"),"Sum","Nothing")))
please notice that this will work properly if you don't have both Avg and Count or Sum in the same string.
If you do, please notice me i'll look for a better approach.
Of course if something doesn't suit your needs also report it back.
Hope this was helpful
explanation:
what's happening is that you're looking for indexes where Avg is in the string inside OtherField column and fill new_col with "Avg" in these indexes. for the remaining fields( where there isn't "Avg", you look for Count and do the same and last you do the same for Sum.
documentation:
np.where
pandas.series.str.contains

Pythonic/efficient way to strip whitespace from every Pandas Data frame cell that has a stringlike object in it

I'm reading a CSV file into a DataFrame. I need to strip whitespace from all the stringlike cells, leaving the other cells unchanged in Python 2.7.
Here is what I'm doing:
def remove_whitespace( x ):
if isinstance( x, basestring ):
return x.strip()
else:
return x
my_data = my_data.applymap( remove_whitespace )
Is there a better or more idiomatic to Pandas way to do this?
Is there a more efficient way (perhaps by doing things column wise)?
I've tried searching for a definitive answer, but most questions on this topic seem to be how to strip whitespace from the column names themselves, or presume the cells are all strings.
Stumbled onto this question while looking for a quick and minimalistic snippet I could use. Had to assemble one myself from posts above. Maybe someone will find it useful:
data_frame_trimmed = data_frame.apply(lambda x: x.str.strip() if x.dtype == "object" else x)
You could use pandas' Series.str.strip() method to do this quickly for each string-like column:
>>> data = pd.DataFrame({'values': [' ABC ', ' DEF', ' GHI ']})
>>> data
values
0 ABC
1 DEF
2 GHI
>>> data['values'].str.strip()
0 ABC
1 DEF
2 GHI
Name: values, dtype: object
We want to:
Apply our function to each element in our dataframe - use applymap.
Use type(x)==str (versus x.dtype == 'object') because Pandas will label columns as object for columns of mixed datatypes (an object column may contain int and/or str).
Maintain the datatype of each element (we don't want to convert everything to a str and then strip whitespace).
Therefore, I've found the following to be the easiest:
df.applymap(lambda x: x.strip() if type(x)==str else x)
When you call pandas.read_csv, you can use a regular expression that matches zero or more spaces followed by a comma followed by zero or more spaces as the delimiter.
For example, here's "data.csv":
In [19]: !cat data.csv
1.5, aaa, bbb , ddd , 10 , XXX
2.5, eee, fff , ggg, 20 , YYY
(The first line ends with three spaces after XXX, while the second line ends at the last Y.)
The following uses pandas.read_csv() to read the files, with the regular expression ' *, *' as the delimiter. (Using a regular expression as the delimiter is only available in the "python" engine of read_csv().)
In [20]: import pandas as pd
In [21]: df = pd.read_csv('data.csv', header=None, delimiter=' *, *', engine='python')
In [22]: df
Out[22]:
0 1 2 3 4 5
0 1.5 aaa bbb ddd 10 XXX
1 2.5 eee fff ggg 20 YYY
The "data['values'].str.strip()" answer above did not work for me, but I found a simple work around. I am sure there is a better way to do this. The str.strip() function works on Series. Thus, I converted the dataframe column into a Series, stripped the whitespace, replaced the converted column back into the dataframe. Below is the example code.
import pandas as pd
data = pd.DataFrame({'values': [' ABC ', ' DEF', ' GHI ']})
print ('-----')
print (data)
data['values'].str.strip()
print ('-----')
print (data)
new = pd.Series([])
new = data['values'].str.strip()
data['values'] = new
print ('-----')
print (new)
Here is a column-wise solution with pandas apply:
import numpy as np
def strip_obj(col):
if col.dtypes == object:
return (col.astype(str)
.str.strip()
.replace({'nan': np.nan}))
return col
df = df.apply(strip_obj, axis=0)
This will convert values in object type columns to string. Should take caution with mixed-type columns. For example if your column is zip codes with 20001 and ' 21110 ' you will end up with '20001' and '21110'.
This worked for me - applies it to the whole dataframe:
def panda_strip(x):
r =[]
for y in x:
if isinstance(y, str):
y = y.strip()
r.append(y)
return pd.Series(r)
df = df.apply(lambda x: panda_strip(x))
I found the following code useful and something that would likely help others. This snippet will allow you to delete spaces in a column as well as in the entire DataFrame, depending on your use case.
import pandas as pd
def remove_whitespace(x):
try:
# remove spaces inside and outside of string
x = "".join(x.split())
except:
pass
return x
# Apply remove_whitespace to column only
df.orderId = df.orderId.apply(remove_whitespace)
print(df)
# Apply to remove_whitespace to entire Dataframe
df = df.applymap(remove_whitespace)
print(df)

Categories